Skip to content

Commit 00724cc

Browse files
aquemyalexbarrosricardodcpereiravascoalramosmiriamspsantos
authored
docs: documentation for ydata synthetic (#247)
* docs: skeleton for the documentation * docs: add example of automated classes * docs: add requirements * docs: complete reference links (#234) * feat: add CTGAN model (#233) * feat: add CTGAN model * fix: change imports * docs: add processors and synthesizers docs * docs: add ctgan reference links * docs: fix gan doc format * docs: fix cgan docstrings * docs: add/fix cramergan docstrings * docs: add cwgangp docstrings * docs: add dragan docstrings * docs: add vanilla gan docstrings * docs: add wgan docstrings * docs: add wgan gp docstrings --------- Co-authored-by: Ricardo Pereira <ricardo.dc.pereira@gmail.com> * chore: add docs publishing ci * chore: update docs ci * chore: fix ci * chore: fix docs ci * chore: fix docs ci * chore: remove doc branch from ci trigger * chore: add dev branch trigger to docs ci * chore: include dev as a PR ci trigger * fix: docs pr ci * fix(docs): replace old reference to base gan model to new one * docs: add initial ydata-synthetic documentation (#274) --------- Co-authored-by: Alex Barros <alexbarros@users.noreply.github.com> Co-authored-by: Ricardo Pereira <ricardo.dc.pereira@gmail.com> Co-authored-by: Vasco Ramos <vasco.ramos@ydata.ai> Co-authored-by: Miriam Seoane Santos <68821478+miriamspsantos@users.noreply.github.com>
1 parent 93952ab commit 00724cc

38 files changed

Lines changed: 1015 additions & 93 deletions

File tree

.github/workflows/docs.yaml

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
name: Publish Documentation
2+
3+
4+
5+
on:
6+
push:
7+
paths:
8+
- .github/workflows/docs.yaml
9+
- docs/**
10+
- mkdocs.yml
11+
- requirements-docs.txt
12+
branches:
13+
- main
14+
- dev
15+
release:
16+
types:
17+
- released
18+
- prereleased
19+
20+
21+
22+
jobs:
23+
prepare:
24+
name: Get Current version
25+
runs-on: ubuntu-22.04
26+
27+
outputs:
28+
version: ${{ steps.version.outputs.value }}
29+
30+
steps:
31+
- uses: actions/checkout@v3
32+
with:
33+
token: ${{ secrets.ACCESS_TOKEN }}
34+
35+
- name: Find Latest Tag
36+
id: latest_tag
37+
uses: oprypin/find-latest-tag@v1.1.1
38+
with:
39+
repository: ${{ github.repository }}
40+
regex: '^\d+\.\d+\.\d+$'
41+
42+
- name: Extract major and minor version
43+
id: version
44+
run: |
45+
echo "value=`echo ${{ steps.latest_tag.outputs.tag }} | sed -r 's|([0-9]+.[0-9]+).*|\1|g'`" >> $GITHUB_OUTPUT
46+
47+
48+
publish-docs:
49+
name: Publish Docs
50+
runs-on: ubuntu-22.04
51+
52+
needs:
53+
- prepare
54+
55+
steps:
56+
- uses: actions/checkout@v3
57+
with:
58+
fetch-depth: 0
59+
token: ${{ secrets.ACCESS_TOKEN }}
60+
61+
- name: Configurating Git
62+
run: |
63+
git config user.email "azory@ydata.ai"
64+
git config user.name "Azory YData Bot"
65+
git config core.autocrlf false
66+
67+
- name: Setup Python
68+
uses: actions/setup-python@v4
69+
with:
70+
python-version: "3.10"
71+
72+
- name: Cache pip dependencies
73+
id: cache
74+
uses: actions/cache@v3
75+
with:
76+
path: ~/.cache/pip
77+
key: ${{ runner.os }}-pip-${{ hashFiles('pyproject.toml') }}
78+
79+
- name: Install doc dependencies
80+
run: |
81+
python -m pip install --upgrade pip
82+
pip install -r requirements-docs.txt
83+
84+
- name: Publish
85+
run: make publish-docs version=${{ needs.prepare.outputs.version }}

.github/workflows/pull_request.yml

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@ on:
55
branches:
66
- renovate/**
77
pull_request:
8-
branches: [ master ]
8+
branches:
9+
- master
10+
- dev
911

1012
jobs:
1113
validate:
@@ -41,3 +43,33 @@ jobs:
4143

4244
- name: Tests
4345
run: make test || exit 0
46+
47+
validate-docs:
48+
name: Validate Docs
49+
runs-on: ubuntu-22.04
50+
51+
steps:
52+
- uses: actions/checkout@v3
53+
54+
- name: Setup Python
55+
uses: actions/setup-python@v4
56+
with:
57+
python-version: "3.10"
58+
59+
- name: Cache pip dependencies
60+
id: cache
61+
uses: actions/cache@v3
62+
with:
63+
path: ~/.cache/pip
64+
key: ${{ runner.os }}-pip-${{ hashFiles('requirements-docs.txt') }}
65+
66+
- name: Install dependencies
67+
run: |
68+
python -m pip install --upgrade pip
69+
pip install -r requirements-docs.txt
70+
71+
- name: Build docs
72+
run: |
73+
echo "0.0dev0" > VERSION
74+
pip install .
75+
mkdocs build

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -255,6 +255,7 @@ pythonenv*
255255

256256
# mkdocs documentation
257257
/site
258+
/static/docs
258259

259260
# mypy
260261
.mypy_cache/
@@ -373,4 +374,4 @@ DerivedData/
373374

374375
# User created
375376
VERSION
376-
version.py
377+
version.py

Makefile

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,5 +31,7 @@ clean: ### Removes build binaries
3131
install: ### Installs required dependencies
3232
$(PIP) install dist/ydata-synthetic-$(version).tar.gz
3333

34-
35-
34+
publish-docs: ### Publishes the documentation
35+
echo "$(version)" > VERSION
36+
$(PIP) install .
37+
mike deploy --push --update-aliases $(version) latest

docs/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# ydata-synthetic documentation
2+
3+
Installing the doc dependencies (one time step):
4+
```
5+
pip install -r requirements-docs.txt
6+
```
7+
8+
Build the doc for deployment:
9+
```
10+
mkdocs build
11+
```
12+
13+
To build and serve locally:
14+
```
15+
mkdocs serve
16+
```

docs/getting-started/examples.md

Whitespace-only changes.
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
2+
`ydata-synthetic` is available through PyPi, allowing an easy process of installation and integration with the data science programing environments (Google Colab, Jupyter Notebooks, Visual Studio Code, PyCharm) and stack (`pandas`, `numpy`, `scikit-learn`).
3+
4+
##Installing the package
5+
Currently, the package supports **python versions over 3.9**, and can be installed in Windows, Linux or MacOS operating systems.
6+
7+
Prior to the package installation, it is recommended the creation of a virtual or `conda` environment:
8+
9+
=== "conda"
10+
``` commandline
11+
conda create -n synth-env python=3.10
12+
conda activate synth-env
13+
```
14+
15+
The above command creates and activates a new environment called "synth-env" with Python version 3.10.X. In the new environment, you can then install `ydata-synthetic`:
16+
17+
=== "pypi"
18+
``` commandline
19+
pip install ydata-synthetic==1.1.0
20+
```
21+
22+
:fontawesome-brands-youtube:{ style="color: #EE0F0F" }
23+
[Installing ydata-synthetic](https://www.youtube.com/watch?v=aESmGcxtBdU) – :octicons-clock-24:
24+
5min – Step-by-step installation guide
25+
26+
## Using Google Colab
27+
To install inside a Google Colab notebook, you can use the following:
28+
29+
``` commandline
30+
!pip install ydata-synthetic==1.1.0
31+
```
32+
33+
Make sure your Google Colab is running Python versions `>=3.9, <3.11`. Learn how to configure Python versions on Google Colab [here](https://stackoverflow.com/questions/68657341/how-can-i-update-google-colabs-python-version/68658479#68658479).
34+
35+
36+
## Installing the Streamlit App
37+
Since version 1.0.0, the `ydata-synthetic` includes a GUI experience provided by a Streamlit app. The UI supports the data synthesization process from reading the data to profiling the synthetic data generation, and can be installed as follows:
38+
39+
``` commandline
40+
pip install "ydata-synthetic[streamlit]"
41+
```
42+
43+
Note that Jupyter or Colab Notebooks are not yet supported, so use it in your Python environment.
44+

docs/getting-started/quickstart.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Quickstart
2+
3+
`ydata-synthetic` is equipped to handle both **tabular** (comprising numeric and categorical features) and sequential, **time-series** data. In this section we explain how you can **quickstart the synthesization** of tabular and time-series datasets.
4+
5+
## Synthesizing a Tabular Dataset
6+
The following example showcases how to synthesize the [Adult Census Income](https://www.kaggle.com/datasets/uciml/adult-census-income) dataset with CTGAN:
7+
=== "Tabular Data"
8+
```python
9+
# Import the necessary modules
10+
from pmlb import fetch_data
11+
from ydata_synthetic.synthesizers.regular import RegularSynthesizer
12+
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
13+
14+
# Load data
15+
data = fetch_data('adult')
16+
num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
17+
cat_cols = ['workclass','education', 'education-num', 'marital-status',
18+
'occupation', 'relationship', 'race', 'sex', 'native-country', 'target']
19+
20+
# Define model and training parameters
21+
ctgan_args = ModelParameters(batch_size=500, lr=2e-4, betas=(0.5, 0.9))
22+
train_args = TrainParameters(epochs=501)
23+
24+
# Train the generator model
25+
synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)
26+
synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)
27+
28+
# Generate 1000 new synthetic samples
29+
synth_data = synth.sample(1000)
30+
```
31+
32+
## Synthesizing a Time-Series Dataset
33+
The following example showcases how to synthesize the [Yahoo Stock Price](https://www.kaggle.com/datasets/arashnic/time-series-forecasting-with-yahoo-stock-price) dataset with TimeGAN:
34+
=== "Time-Series Data"
35+
```python
36+
# Import the necessary modules
37+
import pandas as pd
38+
from ydata_synthetic.synthesizers import ModelParameters
39+
from ydata_synthetic.synthesizers.timeseries import TimeGAN
40+
from ydata_synthetic.preprocessing.timeseries.utils import real_data_loading
41+
42+
# Load and preprocess data
43+
stock_data_df = pd.read_csv("stock_data.csv")
44+
processed_data = real_data_loading(stock_data_df.values, seq_len=24)
45+
46+
# Define model and training parameters
47+
gan_args = ModelParameters(batch_size=128, lr=5e-4, noise_dim=128, layers_dim=128)
48+
synth = TimeGAN(model_parameters=gan_args, hidden_dim=24, seq_len=24, n_seq=6, gamma=1)
49+
50+
# Train the generator model
51+
synth.train(data=processed_data, train_steps=50000)
52+
53+
# Generate new synthetic data
54+
synth_data = synth.sample(len(stock_data_df))
55+
```
56+
57+
## Running the Streamlit App
58+
Once the package is [installed](installation.md) with the "streamlit" extra, the app can be launched as:
59+
60+
=== "Streamlit App"
61+
```python
62+
from ydata_synthetic import streamlit_app
63+
64+
streamlit_app.run()
65+
```
66+
67+
The console will then output the URL from which the app can be accessed.
68+
69+
:fontawesome-brands-youtube:{ style="color: #EE0F0F" } Here's a [quick example](https://www.youtube.com/watch?v=6Lzi26szKNo&t=4s) of how to synthesize data with the Streamlit App – :octicons-clock-24: 5min

docs/index.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
<p></p>
2+
<p align="center"><img width="250" src="https://user-images.githubusercontent.com/3348134/177604157-11181f6c-57e5-44b1-8f6c-774edbba5512.png" alt="YData Logo"></p>
3+
<p></p>
4+
5+
[![pypi](https://img.shields.io/pypi/v/ydata-synthetic)](https://pypi.org/project/ydata-synthetic)
6+
![Pythonversion](https://img.shields.io/badge/python-3.9%20%7C%203.10-blue)
7+
[![downloads](https://static.pepy.tech/badge/ydata-synthetic/month)](https://pepy.tech/project/ydata-synthetic)
8+
![](https://img.shields.io/github/license/ydataai/ydata-synthetic)
9+
![](https://img.shields.io/pypi/status/ydata-synthetic)
10+
[![Build Status](https://github.com/ydataai/ydata-synthetic/actions/workflows/tests.yml/badge.svg?branch=master)](https://github.com/ydataai/ydata-synthetic/actions/workflows/tests.yml)
11+
[![Code Coverage](https://codecov.io/gh/ydataai/ydata-synthetic/branch/master/graph/badge.svg?token=gMptB4YUnF)](https://codecov.io/gh/ydataai/ydata-synthetic)
12+
[![GitHub stars](https://img.shields.io/github/stars/ydataai/ydata-synthetic?style=social)](https://github.com/ydataai/ydata-synthetic)
13+
[![Discord](https://img.shields.io/discord/1037720091376238592?label=Discord&logo=Discord)](https://discord.com/invite/mw7xjJ7b7s)
14+
15+
16+
17+
## Overview
18+
`ydata-synthetic` is the go-to Python package for **synthetic data generation for tabular and time-series data**. It uses the latest Generative AI models to learn the properties of real data and create realistic synthetic data. This project was created to educate the community about synthetic data and its applications in real-world domains, such as data augmentation, bias mitigation, data sharing, and privacy engineering. To learn more about Synthetic Data and its applications, [check this article](https://ydata.ai/resources/10-most-frequently-asked-questions-about-synthetic-data).
19+
20+
## Current Functionality
21+
- 🤖 **Create Realistic Synthetic Data using Generative AI Models:** `ydata-synthetic` supports the state-of-the-art generative adversarial networks for data generation, namely Vanilla GAN, CGAN, WGAN, WGAN-GP, DRAGAN, Cramer GAN, CWGAN-GP, CTGAN, and TimeGAN. Learn more about the use of [GANs for Synthetic Data generation](https://medium.com/ydata-ai/generating-synthetic-tabular-data-with-gans-part-1-866705a77302).
22+
23+
- 📀 **Synthetic Data Generation for Tabular and Time-Series Data:** The package supports the synthesization of tabular and time-series data, covering a wide range of real-world applications. Learn how to leverage `ydata-synthetic` for [tabular](https://ydata.ai/resources/gans-for-synthetic-data-generation) and [time-series](https://towardsdatascience.com/synthetic-time-series-data-a-gan-approach-869a984f2239) data.
24+
25+
- 💻 **Best Generation Experience in Open Source:** Including a guided UI experience for the generation of synthetic data, from reading the data to visualization of synthetic data. All served by a slick Streamlit app.
26+
:fontawesome-brands-youtube:{ style="color: #EE0F0F" } Here's a [quick overview](https://www.youtube.com/watch?v=ep0PhwsFx0A) – :octicons-clock-24: 1min
27+
28+
29+
## Supported Data Types
30+
31+
=== "Tabular Data"
32+
**Tabular data** does not have a temporal dependence, and can be structured and organized in a table-like format, where **features are represented in columns**, whereas **observations correspond to the rows**.
33+
34+
Additionally, tabular data usually comprises both *numeric* and *categorical* features. **Numeric** features are those that encode **quantitative** values, whereas **categorical** represent **qualitative** measurements. Categorical features can further divided in *ordinal*, *binary* or *boolean*, and *nominal* features.
35+
36+
Learn more about synthesizing tabular data in this [article](https://ydata.ai/resources/gans-for-synthetic-data-generation), or check the [quickstart guide](getting-started/quickstart.md#synthesizing-a-tabular-dataset) to get started with the synthesization of tabular datasets.
37+
38+
=== "Time-Series Data"
39+
**Time-series data** exhibit a sequencial, **temporal dependency** between records, and may present a wide range of patterns and trends, including **seasonality** (patterns that repeat at calendar periods -- days, weeks, months -- such as holiday sales, for instance) or **periodicity** (patterns that repeat over time).
40+
41+
Read more about generating time-series data in this [article](https://ydata.ai/resources/synthetic-time-series-data-a-gan-approach) and check this [quickstart guide](getting-started/quickstart.md#synthesizing-a-time-series-dataset) to get started with time-series data synthesization.
42+
43+
44+
## Supported Generative AI Models
45+
The following architectures are currently supported:
46+
47+
- [GAN](https://arxiv.org/abs/1406.2661)
48+
- [CGAN](https://arxiv.org/abs/1411.1784) (Conditional GAN)
49+
- [WGAN](https://arxiv.org/abs/1701.07875) (Wasserstein GAN)
50+
- [WGAN-GP](https://arxiv.org/abs/1704.00028) (Wassertein GAN with Gradient Penalty)
51+
- [DRAGAN](https://arxiv.org/pdf/1705.07215.pdf) (Deep Regret Analytic GAN)
52+
- [Cramer GAN](https://arxiv.org/abs/1705.10743) (Cramer Distance Solution to Biased Wasserstein Gradients)
53+
- [CWGAN-GP](https://cameronfabbri.github.io/papers/conditionalWGAN.pdf) (Conditional Wassertein GAN with Gradient Penalty)
54+
- [CTGAN](https://arxiv.org/pdf/1907.00503.pdf) (Conditional Tabular GAN)
55+
- [TimeGAN](https://papers.nips.cc/paper/2019/file/c9efe5f26cd17ba6216bbe2a7d26d490-Paper.pdf) (specifically for *time-series* data)

docs/reference/api/index.md

Whitespace-only changes.

0 commit comments

Comments
 (0)