|
| 1 | +<p></p> |
| 2 | +<p align="center"><img width="300" src="https://assets.ydata.ai/oss/ydata-synthetic_black.png" alt="YData Synthetic Logo"></p> |
| 3 | +<p></p> |
| 4 | + |
| 5 | +[](https://pypi.org/project/ydata-synthetic) |
| 6 | + |
| 7 | +[](https://pepy.tech/project/ydata-synthetic) |
| 8 | + |
| 9 | + |
| 10 | +[](https://github.com/ydataai/ydata-synthetic/actions/workflows/tests.yml) |
| 11 | +[](https://codecov.io/gh/ydataai/ydata-synthetic) |
| 12 | +[](https://github.com/ydataai/ydata-synthetic) |
| 13 | +[](https://discord.com/invite/mw7xjJ7b7s) |
| 14 | + |
| 15 | +## Overview |
| 16 | +`YData-Synthetic` is an pioneering open-source package developed in 2020 with the primary goal of educating users about generative models for synthetic data generation. |
| 17 | +Designed as a collection of models, it was intended for exploratory studies and educational purposes. |
| 18 | +However, it was not optimized for the quality, performance, and scalability needs typically required by organizations. |
| 19 | + |
| 20 | +!!! tip "We are now ydata-sdk!" |
| 21 | + Even though the journey was fun, and we have learned a lot from the community it is now time to upgrade `ydata-synthetic`. |
| 22 | + |
| 23 | + Heading towards the future of synthetic data generation we recommend users to transition to `ydata-sdk`, which provides a superior experience with enhanced performance, |
| 24 | + precision, and ease of use, making it the preferred tool for synthetic data generation and a perfect introduction to Generative AI. |
| 25 | + |
| 26 | +## Supported Data Types |
| 27 | + |
| 28 | +=== "Tabular Data" |
| 29 | + **Tabular data** does not have a temporal dependence, and can be structured and organized in a table-like format, where **features are represented in columns**, whereas **observations correspond to the rows**. |
| 30 | + |
| 31 | + Additionally, tabular data usually comprises both *numeric* and *categorical* features. **Numeric** features are those that encode **quantitative** values, whereas **categorical** represent **qualitative** measurements. Categorical features can further divided in *ordinal*, *binary* or *boolean*, and *nominal* features. |
| 32 | + |
| 33 | + Learn more about synthesizing tabular data in this [article](https://ydata.ai/resources/gans-for-synthetic-data-generation), or check the [quickstart guide](getting-started/quickstart.md#synthesizing-a-tabular-dataset) to get started with the synthesization of tabular datasets. |
| 34 | + |
| 35 | +=== "Time-Series Data" |
| 36 | + **Time-series data** exhibit a sequencial, **temporal dependency** between records, and may present a wide range of patterns and trends, including **seasonality** (patterns that repeat at calendar periods -- days, weeks, months -- such as holiday sales, for instance) or **periodicity** (patterns that repeat over time). |
| 37 | + |
| 38 | + Read more about generating [time-series data in this article](https://ydata.ai/resources/synthetic-time-series-data-a-gan-approach) and check this [quickstart guide](getting-started/quickstart.md#synthesizing-a-time-series-dataset) to get started with time-series data synthesization. |
| 39 | + |
| 40 | +=== "Multi-Table Data" |
| 41 | + **Multi-Table data** or databases exhibit a referential behaviour between and database schema that is expected to be replicated and respected by the synthetic data generated. |
| 42 | + Read more about database [synthetic data generation in this article]() and check this [quickstart guide for Multi-Table synthetic data generation]() |
| 43 | + **Time-series data** exhibit a sequential, **temporal dependency** between records, and may present a wide range of patterns and trends, including **seasonality** (patterns that repeat at calendar periods -- days, weeks, months -- such as holiday sales, for instance) or **periodicity** (patterns that repeat over time). |
| 44 | + |
| 45 | +## Validate the quality of your synthetic data generated |
| 46 | + |
| 47 | +Validating the quality of synthetic data is essential to ensure its usefulness and privacy. YData Fabric provides tools for comprehensive synthetic data evaluation through: |
| 48 | + |
| 49 | +1. **Profile Comparison Visualization:** |
| 50 | +Fabric delivers side-by-side visual comparisons of key data properties (e.g., distributions, correlations, and outliers) between synthetic and original datasets, allowing users to assess fidelity at a glance. |
| 51 | + |
| 52 | +2. **PDF Report with Metrics:** |
| 53 | +Fabric generates a PDF report that includes key metrics to evaluate: |
| 54 | + |
| 55 | +- Fidelity: How closely synthetic data matches the original. |
| 56 | +- Utility: How well it performs in real-world tasks. |
| 57 | +- Privacy: Risk assessment of data leakage and re-identification. |
| 58 | + |
| 59 | +These tools ensure a thorough validation of synthetic data quality, making it reliable for real-world use. |
| 60 | + |
| 61 | +## Supported Generative AI Models |
| 62 | +With the upcoming update of `ydata-synthetic`to `ydata-sdk`, users will now have access to a single API that automatically selects and optimizes |
| 63 | +the best generative model for their data. This streamlined approach eliminates the need to choose between |
| 64 | +various models manually, as the API intelligently identifies the optimal model based on the specific dataset and use case. |
| 65 | + |
| 66 | +Instead of having to manually select from models such as: |
| 67 | + |
| 68 | +- [GAN](https://arxiv.org/abs/1406.2661) |
| 69 | +- [CGAN](https://arxiv.org/abs/1411.1784) (Conditional GAN) |
| 70 | +- [WGAN](https://arxiv.org/abs/1701.07875) (Wasserstein GAN) |
| 71 | +- [WGAN-GP](https://arxiv.org/abs/1704.00028) (Wassertein GAN with Gradient Penalty) |
| 72 | +- [DRAGAN](https://arxiv.org/pdf/1705.07215.pdf) (Deep Regret Analytic GAN) |
| 73 | +- [Cramer GAN](https://arxiv.org/abs/1705.10743) (Cramer Distance Solution to Biased Wasserstein Gradients) |
| 74 | +- [CWGAN-GP](https://cameronfabbri.github.io/papers/conditionalWGAN.pdf) (Conditional Wassertein GAN with Gradient Penalty) |
| 75 | +- [CTGAN](https://arxiv.org/pdf/1907.00503.pdf) (Conditional Tabular GAN) |
| 76 | +- [TimeGAN](https://papers.nips.cc/paper/2019/file/c9efe5f26cd17ba6216bbe2a7d26d490-Paper.pdf) (specifically for *time-series* data) |
| 77 | +- [DoppelGANger](https://dl.acm.org/doi/pdf/10.1145/3419394.3423643) (specifically for *time-series* data) |
| 78 | + |
| 79 | +The new API handles model selection automatically, optimizing for the best performance in fidelity, utility, and privacy. |
| 80 | +This significantly simplifies the synthetic data generation process, ensuring that users get the highest quality output without |
| 81 | +the need for manual intervention and tiring hyperparameter tuning. |
| 82 | + |
| 83 | +<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=dd69a9f9-0901-4cb4-9e56-b1e69877dca1" /> |
0 commit comments