|
| 1 | +# Synthesize tabular data |
| 2 | + |
| 3 | +**Using *CTGAN* to generate tabular synthetic data:** |
| 4 | + |
| 5 | +Real-world domains are often described by **tabular data** i.e., data that can be structured and organized in a table-like format, where **features/variables** are represented in **columns**, whereas **observations** correspond to the **rows**. |
| 6 | + |
| 7 | +Additionally, real-world data usually comprises both **numeric** and **categorical** features. Numeric features are those that encode quantitative values, whereas categorical represent qualitative measurements. |
| 8 | + |
| 9 | +CTGAN was specifically designed to deal with the challenges posed by tabular datasets, handling mixed (numeric and categorical) data: |
| 10 | + |
| 11 | +- 📑 **Paper:** [Modeling Tabular Data using Conditional GAN](https://arxiv.org/pdf/1907.00503.pdf) |
| 12 | + |
| 13 | +Here’s an example of how to synthetize tabular data with CTGAN using the [Adult Census Income](https://www.kaggle.com/datasets/uciml/adult-census-income?resource=download) dataset: |
| 14 | + |
| 15 | +```python |
| 16 | +--8<-- "examples/regular/models/adult_ctgan.py" |
| 17 | +``` |
| 18 | + |
| 19 | +## Best practices & results optimization |
| 20 | + |
| 21 | +!!! tip "Generate the best synthetic data quality" |
| 22 | + |
| 23 | + If you are having a hard time in ensuring that CTGAN returns the synthetic data quality that you need for your use-case |
| 24 | + give it a try to [YData Fabric Synthetic Data](https://ydata.ai/register). |
| 25 | + **Fabric Synthetic Data generation** is considered the best in terms of quality. |
| 26 | + [Read more about it in this benchmark](https://www.linkedin.com/pulse/generative-ai-synthetic-data-vendor-comparison-best-vincent-granville). |
| 27 | + |
| 28 | +**CTGAN**, as any other Machine Learning model, requires optimization at the level of the data preparation as well as |
| 29 | +hyperparameter tuning. Here follows a list of best-practices and tips to improve your synthetic data quality: |
| 30 | + |
| 31 | +- **Understand Your Data:** |
| 32 | +Thoroughly understand the characteristics and distribution of your original dataset before using CTGAN. |
| 33 | +Identify important features, correlations, and patterns in the data. |
| 34 | +Leverage [ydata-profiling](https://pypi.org/project/ydata-profiling/) feature to automate the process of understanding your data. |
| 35 | + |
| 36 | +- **Data Preprocess:** |
| 37 | +Clean and preprocess your data to handle missing values, outliers, and other anomalies before training CTGAN. |
| 38 | +Standardize or normalize numerical features to ensure consistent scales. |
| 39 | + |
| 40 | +- **Feature Engineering:** |
| 41 | +Create additional meaningful features that could improve the quality of the synthetic data. |
| 42 | + |
| 43 | +- **Optimize Model Parameters:** |
| 44 | +Experiment with CTGAN hyperparameters such as *epochs*, *batch_size*, and *gen_dim* to find the values that work best |
| 45 | +for your specific dataset. |
| 46 | +Fine-tune the *learning rate* for better convergence. |
| 47 | + |
| 48 | +- **Conditional Generation:** |
| 49 | +Leverage the conditional generation capabilities of CTGAN by specifying conditions for certain features if applicable. |
| 50 | +Adjust the conditioning mechanism to enhance the relevance of generated samples. |
| 51 | + |
| 52 | +- **Handle Imbalanced Data:** |
| 53 | +If your original dataset is imbalanced, ensure that CTGAN captures the distribution of minority classes effectively. |
| 54 | +Adjust sampling strategies if needed. |
| 55 | + |
| 56 | +- **Use Larger Datasets:** |
| 57 | +Train CTGAN on larger datasets when possible to capture a more comprehensive representation of the underlying data distribution. |
0 commit comments