Skip to content

Commit 2c5d22a

Browse files
fabclmntFabiana Clemente
andauthored
docs: update docs with GMM and relational database synthesis. (#313)
Co-authored-by: Fabiana Clemente <fabianaclemente@Fabianas-MacBook-Air.local>
1 parent 1279e5e commit 2c5d22a

17 files changed

Lines changed: 238 additions & 56 deletions

docs/examples/ctgan_example.md

Lines changed: 0 additions & 18 deletions
This file was deleted.

docs/stylesheets/extra.css

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,17 @@
55
transform: none;
66
}
77

8+
.md-content {
9+
--md-typeset-a-color: #002b9e;
10+
}
11+
12+
@media {
13+
.md-button--ydata {
14+
--md-primary-fg-color: #E32212;
15+
--md-primary-bg-color: #E32212;
16+
}
17+
}
18+
819
:root {
920
/* Primary color shades */
1021
--md-primary-fg-color: #040404;
@@ -19,4 +30,22 @@
1930
--md-accent-fg-color--transparent: hsla(189, 100%, 37%, 0.1);
2031
--md-accent-bg-color: hsla(0, 0%, 100%, 1);
2132
--md-accent-bg-color--light: hsla(0, 0%, 100%, 0.7);
22-
}
33+
}
34+
35+
:root > * {
36+
/* Code block color shades */
37+
--md-code-bg-color: hsla(0, 0%, 96%, 1);
38+
--md-code-fg-color: hsla(200, 18%, 26%, 1);
39+
40+
/* Footer */
41+
--md-footer-bg-color: #040404;
42+
--md-footer-bg-color--dark: hsla(0, 0%, 0%, 0.32);
43+
--md-footer-fg-color: hsla(0, 0%, 100%, 1);
44+
--md-footer-fg-color--light: hsla(0, 0%, 100%, 0.7);
45+
--md-footer-fg-color--lighter: hsla(0, 0%, 100%, 0.3);
46+
}
47+
48+
.youtube {
49+
color: #EE0F0F;
50+
}
51+
File renamed without changes.

docs/synthetic_data/index.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Synthetic data generation
2+
3+
[Synthetic data](https://ydata.ai/products/synthetic_data) is data that has been created artificially through computer simulation or that algorithms can generate to
4+
take the place of real-world data. The data can be used as an alternative or supplement to real-world data when real-world
5+
data is not readily available. It can also be used as a Machine Learning performance booster.
6+
7+
The ydata-synthetic package is an open-source Python package developed by YData’s team that allows users to experiment
8+
with several generative models for synthetic data generation. The main goal of the package is to serve as a way for data
9+
scientists to get familiar with synthetic data and its applications in real-world domains, as well as the potential of **Generative AI**.
10+
11+
The *ydata-synthetic* package provides different methods for generating synthetic tabular and time-series data,
12+
such as Variational Auto Encoders (VAE), [Gaussian Mixture Models (GMM)](single_table/gmm_example.md), and [Conditional Generative Adversarial Networks (CTGAN)](single_table/ctgan_example.md).
13+
The package also includes a user-friendly UI interface that guides users through the steps and inputs to generate synthetic data
14+
samples.
15+
16+
The package also aims to facilitate the exploration and understanding of synthetic data generation methods and their limitations.
17+
18+
### 📄<a href="single_table/ctgan_example.md"><u>Get started with synthetic data for tabular data with CTGAN</u></a>
19+
### 📈 <a href="time_series/timegan_example.md"><u>Get started with synthetic data for time-series with TimeGAN</u></a>
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Multiple tables synthetic data generation **
2+
3+
!!! info "** YData's Enterprise feature"
4+
5+
This feature is only available for users of [YData Fabric](https://ydata.ai).
6+
7+
[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) and
8+
try synthetic data generation from multiple tables or [contact us](https://ydata.ai/contact-us) for more informations.
9+
10+
Multitable synthetic data enables the creation of large, diverse
11+
datasets crucial for training robust machine learning models, algorithm testing, and addressing privacy concerns. It can be
12+
crucial to enable proper data democratization within an organization.
13+
14+
Nevertheless, the process of generating a full database or even several tables that share relations, can be particularly
15+
challenging due to the necessity of preserving referential integrity across diverse tables and scale. This involves maintaining
16+
realistic relationships between entities to mirror real-world scenarios accurately while being able to process large volumes
17+
of data.
18+
19+
[YData Fabric](https://ydata.ai/products/fabric) offers a cutting-edge Synthetic data generation process that seamlessly integrates with your existing Relational databases.
20+
By replicating the data's value and structure to a new target storage, Fabric delivers a wide range of benefits and use-cases.
21+
These include reducing risk and improving compliance by substituting operational databases with synthetic databases for tests and development. It also enables QA teams to create comprehensive and more flexible testing scenarios.
22+
23+
Explore [Fabric](https://ydata.ai/register) multi-table synthesis capabilities:
24+
25+
### From what sources am I able to train a multi-tables synthetic data generator?
26+
- From a relational database
27+
- From the upload of multiple files
28+
29+
### Related materials
30+
- 📖 <a href="https://ydata.ai/resources/whitepaper-relational-databases-synthetic-data"><u>Read more about Fabric multi-table synthesis process with this whitepaper</u></a>
31+
- :fontawesome-brands-youtube:{ .youtube } <a href="https://www.youtube.com/watch?v=9EupCg5YQLE&t=130s"><u>See Fabric multi-table synthesis in action</u></a>
File renamed without changes.
File renamed without changes.
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Synthesize tabular data
2+
3+
**Using *CTGAN* to generate tabular synthetic data:**
4+
5+
Real-world domains are often described by **tabular data** i.e., data that can be structured and organized in a table-like format, where **features/variables** are represented in **columns**, whereas **observations** correspond to the **rows**.
6+
7+
Additionally, real-world data usually comprises both **numeric** and **categorical** features. Numeric features are those that encode quantitative values, whereas categorical represent qualitative measurements.
8+
9+
CTGAN was specifically designed to deal with the challenges posed by tabular datasets, handling mixed (numeric and categorical) data:
10+
11+
- 📑 **Paper:** [Modeling Tabular Data using Conditional GAN](https://arxiv.org/pdf/1907.00503.pdf)
12+
13+
Here’s an example of how to synthetize tabular data with CTGAN using the [Adult Census Income](https://www.kaggle.com/datasets/uciml/adult-census-income?resource=download) dataset:
14+
15+
```python
16+
--8<-- "examples/regular/models/adult_ctgan.py"
17+
```
18+
19+
## Best practices & results optimization
20+
21+
!!! tip "Generate the best synthetic data quality"
22+
23+
If you are having a hard time in ensuring that CTGAN returns the synthetic data quality that you need for your use-case
24+
give it a try to [YData Fabric Synthetic Data](https://ydata.ai/register).
25+
**Fabric Synthetic Data generation** is considered the best in terms of quality.
26+
[Read more about it in this benchmark](https://www.linkedin.com/pulse/generative-ai-synthetic-data-vendor-comparison-best-vincent-granville).
27+
28+
**CTGAN**, as any other Machine Learning model, requires optimization at the level of the data preparation as well as
29+
hyperparameter tuning. Here follows a list of best-practices and tips to improve your synthetic data quality:
30+
31+
- **Understand Your Data:**
32+
Thoroughly understand the characteristics and distribution of your original dataset before using CTGAN.
33+
Identify important features, correlations, and patterns in the data.
34+
Leverage [ydata-profiling](https://pypi.org/project/ydata-profiling/) feature to automate the process of understanding your data.
35+
36+
- **Data Preprocess:**
37+
Clean and preprocess your data to handle missing values, outliers, and other anomalies before training CTGAN.
38+
Standardize or normalize numerical features to ensure consistent scales.
39+
40+
- **Feature Engineering:**
41+
Create additional meaningful features that could improve the quality of the synthetic data.
42+
43+
- **Optimize Model Parameters:**
44+
Experiment with CTGAN hyperparameters such as *epochs*, *batch_size*, and *gen_dim* to find the values that work best
45+
for your specific dataset.
46+
Fine-tune the *learning rate* for better convergence.
47+
48+
- **Conditional Generation:**
49+
Leverage the conditional generation capabilities of CTGAN by specifying conditions for certain features if applicable.
50+
Adjust the conditioning mechanism to enhance the relevance of generated samples.
51+
52+
- **Handle Imbalanced Data:**
53+
If your original dataset is imbalanced, ensure that CTGAN captures the distribution of minority classes effectively.
54+
Adjust sampling strategies if needed.
55+
56+
- **Use Larger Datasets:**
57+
Train CTGAN on larger datasets when possible to capture a more comprehensive representation of the underlying data distribution.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)