Skip to content

Commit d43b5f3

Browse files
authored
chore: Streamlit demo app to generate synthetic dataset using ydata-synthetic on tabular data (#166)
* Add files via upload * Delete examples/regular/ydata-synthetic-streamlit directory * Demo app with Streamlit * Update app.py * Update app.py * Update app.py * Add files via upload * Create app.gif * Create README.md * Create requirements.txt * Update README.md * Update README.md * Update requirements.txt * Update README.md * Update README.md * Update README.md
1 parent f25ff47 commit d43b5f3

7 files changed

Lines changed: 162 additions & 0 deletions

File tree

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
[theme]
2+
primaryColor="#040000"
3+
backgroundColor="#770303"
4+
secondaryBackgroundColor="#000000"
5+
textColor="#f2f2f3"
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Streamlit application to generate synthetic data using ydata-synthetic
2+
3+
<img src="https://github.com/rajeshai/ydata-synthetic/blob/dev/examples/regular/streamlit%20app/app.JPG" alt="streamlit app to generate synthetic data">
4+
5+
This application takes a pre-processed dataset as input and outputs a synthetic dataset based on the given input parameters. This is made with open source libraries streamlit, ydata-synthetic and deployed on the streamlit cloud.
6+
7+
## How to use
8+
9+
1. Upload a pre-processed dataset.
10+
2. Choose the numerical features and categorical features.
11+
3. Choose all the training parameters appropriately.
12+
4. Click the 'click here to start the training process' button.
13+
14+
<img src="https://github.com/rajeshai/ydata-synthetic/blob/dev/examples/regular/streamlit%20app/app.gif" alt="streamlit app to generate synthetic data">
15+
16+
Wait for the training to end. You will see a graph comparing the original data and synthetic data after training.
17+
Please use less number of epochs to complete the training process quickly as this application is deployed on the community cloud of streamlit which has computational limits.
18+
19+
## Contributing
20+
21+
Find the application here in this link [![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/rajeshai/ydata-synthetic-streamlit/main/app.py)
22+
23+
Feel free to contribute to this app by adding more features and optimizing its performance further.
Lines changed: 15 additions & 0 deletions
Loading
108 KB
Loading
8.38 MB
Loading
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
import os
2+
import streamlit as st
3+
import pandas as pd
4+
import matplotlib.pyplot as plt
5+
import seaborn as sns
6+
from ydata_synthetic.synthesizers.regular import DRAGAN, CGAN, CRAMERGAN, WGAN_GP
7+
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
8+
9+
st.set_page_config(layout="wide",initial_sidebar_state="auto")
10+
os.environ['TF_XLA_FLAGS'] = '--tf_xla_enable_xla_devices'
11+
def run():
12+
#global data_synn
13+
st.sidebar.image('YData_logo.svg')
14+
st.title('Generate synthetic data for a tabular classification dataset using [ydata-synthetic](https://github.com/ydataai/ydata-synthetic)')
15+
st.markdown('This streamlit application can generate synthetic data for your dataset. Please read all the instructions in the sidebar before you start the process.')
16+
data = st.file_uploader('Upload a preprocessed dataset in csv format')
17+
st.sidebar.title('About')
18+
st.sidebar.markdown('[ydata-synthetic](https://github.com/ydataai/ydata-synthetic) is an open-source library and is used to generate synthetic data mimicking the real world data.')
19+
st.sidebar.header('What is synthetic data?')
20+
st.sidebar.markdown('Synthetic data is artificially generated data that is not collected from real world events. It replicates the statistical components of real data without containing any identifiable information, ensuring individuals privacy.')
21+
st.sidebar.header('Why Synthetic Data?')
22+
st.sidebar.markdown('''Synthetic data can be used for many applications:
23+
- Privacy
24+
- Remove bias
25+
- Balance datasets
26+
- Augment datasets''')
27+
28+
29+
st.sidebar.header('Steps to follow')
30+
st.sidebar.markdown('''
31+
- Upload any preprocessed tabular classification dataset.
32+
- Choose the parameters in the adjacent window appropriately.
33+
- Since this is a demo, please choose less number of epochs for quick completion of training.
34+
- After choosing all parameters, Click the button under the parameters to start training.
35+
- After the training is complete, you will see a graph comparing both real data set and synthetic dataset. Categorical columns are used to compare.
36+
- You will also see a button to download your synthetic dataset. Click that button to download your dataset.''')
37+
38+
st.sidebar.markdown('''[![Repo](https://badgen.net/badge/icon/GitHub?icon=github&label)](https://github.com/ydataai/ydata-synthetic)''',unsafe_allow_html=True)
39+
40+
@st.cache
41+
def train(df):
42+
#models_dir = './cache'
43+
gan_args = ModelParameters(batch_size=batch_size,
44+
lr=learning_rate*0.001,
45+
betas=(beta_1, beta_2),
46+
noise_dim=noise_dim,
47+
layers_dim=layer_dim)
48+
49+
train_args = TrainParameters(epochs=epochs,
50+
sample_interval=log_step)
51+
synthesizer = model(gan_args, n_discriminator=3)
52+
synthesizer.train(data, train_args, num_cols, cat_cols)
53+
synthesizer.save('data_synth.pkl')
54+
synthesizer = model.load('data_synth.pkl')
55+
data_syn = synthesizer.sample(samples)
56+
return data_syn
57+
@st.cache
58+
def convert_df(df):
59+
return df.to_csv().encode('utf-8')
60+
if data is not None:
61+
data = pd.read_csv(data)
62+
data.dropna(inplace=True)
63+
st.header('Choose the parameters!!')
64+
col1, col2, col3,col4 = st.columns(4)
65+
with col1:
66+
model = st.selectbox('Choose the GAN model', ['DRAGAN','CGAN','CRAMEGAN','WGAN_GP'],key=1)
67+
if model=='DRAGAN':
68+
model = DRAGAN
69+
elif model=='CGAN':
70+
model=CGAN
71+
elif model=='CRAMEGAN':
72+
model = CRAMERGAN
73+
else:
74+
model = WGAN_GP
75+
num_cols = st.multiselect('Choose the numerical columns', data.columns,key=1)
76+
cat_cols = st.multiselect('Choose categorical columns', [x for x in data.columns if x not in num_cols], key=2)
77+
78+
with col2:
79+
noise_dim = st.number_input('Select noise dimension', 0,200,128,1)
80+
layer_dim = st.number_input('Select the layer dimension', 0,200,128,1)
81+
batch_size = st.number_input('Select batch size', 0,500, 500,1)
82+
83+
with col3:
84+
log_step = st.number_input('Select sample interval', 0,200,100,1)
85+
epochs = st.number_input('Select the number of epochs',0,50,2,1)
86+
learning_rate = st.number_input('Select learning rate(x1e-3', 0.01, 0.1, 0.05, 0.01)
87+
88+
with col4:
89+
beta_1 = st.slider('Select first beta co-efficient', 0.0, 1.0, 0.5)
90+
beta_2 = st.slider('Select second beta co-efficient', 0.0, 1.0, 0.9)
91+
samples = st.number_input('Select the number of synthetic samples to be generated', 0, 400000, step=1000)
92+
if st.button('Click here to start the training process'):
93+
if data is not None:
94+
st.write('Model Training is in progress. It may take a few minutes. Please wait for a while.')
95+
data_synn = train(data)
96+
st.success('Synthetic dataset with the given number of samples is generated!!')
97+
st.subheader('Real Data vs Synthetic Data')
98+
f , axes = plt.subplots(len(cat_cols),2, figsize=(20,25))
99+
f.suptitle('Real data vs Synthetic data')
100+
for i, j in enumerate(cat_cols):
101+
sns.countplot(x=j, data=data, ax = axes[i,0])
102+
sns.countplot(x=j, data=data_synn, ax = axes[i,1])
103+
st.pyplot(f)
104+
st.download_button(
105+
label="Download data as CSV",
106+
data=convert_df(data_synn),
107+
file_name='data_syn.csv',
108+
mime='text/csv')
109+
st.balloons()
110+
else:
111+
st.write('Upload a dataset to train!!')
112+
if __name__== '__main__':
113+
run()
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
pandas
2+
matplotlib
3+
numpy
4+
seaborn
5+
streamlit
6+
ydata-synthetic

0 commit comments

Comments
 (0)