|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "id": "4fe79b74", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "# From Expectations to Synthetic Data generation\n", |
| 9 | + "\n", |
| 10 | + "## 2. The synthetic data generation\n" |
| 11 | + ] |
| 12 | + }, |
| 13 | + { |
| 14 | + "cell_type": "code", |
| 15 | + "execution_count": 1, |
| 16 | + "id": "57c391ae", |
| 17 | + "metadata": {}, |
| 18 | + "outputs": [], |
| 19 | + "source": [ |
| 20 | + "from IPython.display import JSON, display_json\n", |
| 21 | + "import json\n", |
| 22 | + "\n", |
| 23 | + "import pandas as pd" |
| 24 | + ] |
| 25 | + }, |
| 26 | + { |
| 27 | + "cell_type": "code", |
| 28 | + "execution_count": 2, |
| 29 | + "id": "21c70290", |
| 30 | + "metadata": {}, |
| 31 | + "outputs": [], |
| 32 | + "source": [ |
| 33 | + "dataset_name = \"Cardiovascular\"\n", |
| 34 | + "data = pd.read_csv('cardio.csv')\n", |
| 35 | + "\n", |
| 36 | + "f = open(f'.profile_{dataset_name}.json')\n", |
| 37 | + "json_profile = json.load(f)\n", |
| 38 | + "json_profile = json.loads(json_profile)" |
| 39 | + ] |
| 40 | + }, |
| 41 | + { |
| 42 | + "cell_type": "markdown", |
| 43 | + "id": "f7a88682", |
| 44 | + "metadata": {}, |
| 45 | + "source": [ |
| 46 | + "Let's leverage `pandas-profiling` automated detection of data types to separate the columns by dtype for the synthesis process." |
| 47 | + ] |
| 48 | + }, |
| 49 | + { |
| 50 | + "cell_type": "code", |
| 51 | + "execution_count": 3, |
| 52 | + "id": "f3803d28", |
| 53 | + "metadata": {}, |
| 54 | + "outputs": [ |
| 55 | + { |
| 56 | + "name": "stdout", |
| 57 | + "output_type": "stream", |
| 58 | + "text": [ |
| 59 | + "Number of categorical: 5, Number of numerical: 6\n" |
| 60 | + ] |
| 61 | + } |
| 62 | + ], |
| 63 | + "source": [ |
| 64 | + "num_cols = [col for col, val in json_profile['variables'].items() if val['type']=='Numeric' and col!='cardio']\n", |
| 65 | + "cat_cols = [col for col, val in json_profile['variables'].items() if val['type']=='Categorical' and col!='cardio'] \n", |
| 66 | + "\n", |
| 67 | + "print(f'Number of categorical: {len(num_cols)}, Number of numerical: {len(cat_cols)}')" |
| 68 | + ] |
| 69 | + }, |
| 70 | + { |
| 71 | + "cell_type": "markdown", |
| 72 | + "id": "78af8502", |
| 73 | + "metadata": {}, |
| 74 | + "source": [ |
| 75 | + "### Prepare the data for synthesis\n", |
| 76 | + "\n", |
| 77 | + "After checking the warnings generated by `pandas-profiling` we are able to understand that the cardio dataset is generally well behaved, meaning that we can leverage the standard data preparation performed by the synthesizer architectures: \n", |
| 78 | + "- **Numerical columns** - Standard Scaler, important to ensure a faster convergence of the models and ease of results reproducibility.\n", |
| 79 | + "- **Categorical columns** - Label Encoder\n", |
| 80 | + "\n", |
| 81 | + "Because we aim to generate synthetic data as close to the original one, it is recommend to not perform any outlier treatment of feature engineering. \n", |
| 82 | + "\n", |
| 83 | + "<div class=\"alert alert-block alert-warning\">\n", |
| 84 | + "<b>Note:</b> The selection and use of data transformations for synthesis will vary based on the dataset and synthetic data generation approach.\n", |
| 85 | + "</div>" |
| 86 | + ] |
| 87 | + }, |
| 88 | + { |
| 89 | + "cell_type": "markdown", |
| 90 | + "id": "17acc52e", |
| 91 | + "metadata": {}, |
| 92 | + "source": [ |
| 93 | + "As we have selected a Conditional GAN architecture, this means we need to select a conditional column. Tipically and to optimize the utlity and fidelity of the generated data, it is recommend to select the *target* variable, in case the dataset has one. \n", |
| 94 | + "\n", |
| 95 | + "For the `Cardiovascular disease` dataset, we are going to consider the variable *Cardio* as our conditional column\n", |
| 96 | + "\n", |
| 97 | + "<p style=\"text-align:center;\"><img src=\"img/cgan.jpeg\" alt = \"test pic\" width=\"500\" height=\"200\"></p>\n", |
| 98 | + "\n", |
| 99 | + "[Image source](https://arxiv.org/abs/1411.1784)" |
| 100 | + ] |
| 101 | + }, |
| 102 | + { |
| 103 | + "cell_type": "code", |
| 104 | + "execution_count": 4, |
| 105 | + "id": "c851bc34", |
| 106 | + "metadata": {}, |
| 107 | + "outputs": [ |
| 108 | + { |
| 109 | + "data": { |
| 110 | + "text/plain": [ |
| 111 | + "0 34701\n", |
| 112 | + "1 33970\n", |
| 113 | + "Name: cardio, dtype: int64" |
| 114 | + ] |
| 115 | + }, |
| 116 | + "execution_count": 4, |
| 117 | + "metadata": {}, |
| 118 | + "output_type": "execute_result" |
| 119 | + } |
| 120 | + ], |
| 121 | + "source": [ |
| 122 | + "#The cardio dataset is pretty balanced in what concerns the target variable.\n", |
| 123 | + "#But not only - variables like gender are quite balanced even when compared under the context of the target\n", |
| 124 | + "data['cardio'].value_counts()" |
| 125 | + ] |
| 126 | + }, |
| 127 | + { |
| 128 | + "cell_type": "markdown", |
| 129 | + "id": "a45daedc", |
| 130 | + "metadata": {}, |
| 131 | + "source": [ |
| 132 | + "### Training a synthesizer" |
| 133 | + ] |
| 134 | + }, |
| 135 | + { |
| 136 | + "cell_type": "markdown", |
| 137 | + "id": "3aa4eee3", |
| 138 | + "metadata": {}, |
| 139 | + "source": [ |
| 140 | + "##Add here more details on the parameters" |
| 141 | + ] |
| 142 | + }, |
| 143 | + { |
| 144 | + "cell_type": "code", |
| 145 | + "execution_count": 5, |
| 146 | + "id": "6d45dafc", |
| 147 | + "metadata": {}, |
| 148 | + "outputs": [], |
| 149 | + "source": [ |
| 150 | + "from ydata_synthetic.synthesizers.regular import RegularSynthesizer\n", |
| 151 | + "from ydata_synthetic.synthesizers import ModelParameters, TrainParameters" |
| 152 | + ] |
| 153 | + }, |
| 154 | + { |
| 155 | + "cell_type": "code", |
| 156 | + "execution_count": 6, |
| 157 | + "id": "c4917865", |
| 158 | + "metadata": {}, |
| 159 | + "outputs": [], |
| 160 | + "source": [ |
| 161 | + "## Setting the architecture hyperparameters\n", |
| 162 | + "noise_dim = 32\n", |
| 163 | + "dim = 128\n", |
| 164 | + "batch_size = 64\n", |
| 165 | + "\n", |
| 166 | + "#Defined as per the literature on CWGAN\n", |
| 167 | + "beta_1 = 0.5\n", |
| 168 | + "beta_2 = 0.9\n", |
| 169 | + "\n", |
| 170 | + "log_step = 100\n", |
| 171 | + "epochs = 5 + 1\n", |
| 172 | + "learning_rate = 0.0001\n", |
| 173 | + "models_dir = '../cache'\n", |
| 174 | + "\n", |
| 175 | + "model_parameters = ModelParameters(batch_size=batch_size,\n", |
| 176 | + " lr=learning_rate,\n", |
| 177 | + " betas=(beta_1, beta_2),\n", |
| 178 | + " noise_dim=noise_dim,\n", |
| 179 | + " layers_dim=dim)\n", |
| 180 | + "\n", |
| 181 | + "train_args = TrainParameters(epochs=epochs,\n", |
| 182 | + " cache_prefix='',\n", |
| 183 | + " sample_interval=log_step,\n", |
| 184 | + " label_dim=-1,\n", |
| 185 | + " labels=(0,1))" |
| 186 | + ] |
| 187 | + }, |
| 188 | + { |
| 189 | + "cell_type": "code", |
| 190 | + "execution_count": 7, |
| 191 | + "id": "92e73ea7", |
| 192 | + "metadata": {}, |
| 193 | + "outputs": [ |
| 194 | + { |
| 195 | + "name": "stderr", |
| 196 | + "output_type": "stream", |
| 197 | + "text": [ |
| 198 | + "2022-08-17 08:54:12.781032: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", |
| 199 | + "2022-08-17 08:54:12.809180: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory\n", |
| 200 | + "2022-08-17 08:54:12.809200: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n", |
| 201 | + "Skipping registering GPU devices...\n", |
| 202 | + "2022-08-17 08:54:12.890704: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\n", |
| 203 | + "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n" |
| 204 | + ] |
| 205 | + }, |
| 206 | + { |
| 207 | + "name": "stdout", |
| 208 | + "output_type": "stream", |
| 209 | + "text": [ |
| 210 | + "Number of iterations per epoch: 1073\n" |
| 211 | + ] |
| 212 | + }, |
| 213 | + { |
| 214 | + "name": "stderr", |
| 215 | + "output_type": "stream", |
| 216 | + "text": [ |
| 217 | + " 17%|███████▌ | 1/6 [00:21<01:45, 21.06s/it]" |
| 218 | + ] |
| 219 | + }, |
| 220 | + { |
| 221 | + "name": "stdout", |
| 222 | + "output_type": "stream", |
| 223 | + "text": [ |
| 224 | + "Epoch: 0 | critic_loss: -0.2805415987968445 | gen_loss: 0.20160722732543945\n" |
| 225 | + ] |
| 226 | + }, |
| 227 | + { |
| 228 | + "name": "stderr", |
| 229 | + "output_type": "stream", |
| 230 | + "text": [ |
| 231 | + " 33%|███████████████ | 2/6 [00:41<01:23, 20.78s/it]" |
| 232 | + ] |
| 233 | + }, |
| 234 | + { |
| 235 | + "name": "stdout", |
| 236 | + "output_type": "stream", |
| 237 | + "text": [ |
| 238 | + "Epoch: 1 | critic_loss: -0.33020472526550293 | gen_loss: 0.31511062383651733\n" |
| 239 | + ] |
| 240 | + }, |
| 241 | + { |
| 242 | + "name": "stderr", |
| 243 | + "output_type": "stream", |
| 244 | + "text": [ |
| 245 | + " 50%|██████████████████████▌ | 3/6 [01:00<01:00, 20.03s/it]" |
| 246 | + ] |
| 247 | + }, |
| 248 | + { |
| 249 | + "name": "stdout", |
| 250 | + "output_type": "stream", |
| 251 | + "text": [ |
| 252 | + "Epoch: 2 | critic_loss: -0.22822757065296173 | gen_loss: 0.22219200432300568\n" |
| 253 | + ] |
| 254 | + }, |
| 255 | + { |
| 256 | + "name": "stderr", |
| 257 | + "output_type": "stream", |
| 258 | + "text": [ |
| 259 | + " 67%|██████████████████████████████ | 4/6 [01:19<00:39, 19.56s/it]" |
| 260 | + ] |
| 261 | + }, |
| 262 | + { |
| 263 | + "name": "stdout", |
| 264 | + "output_type": "stream", |
| 265 | + "text": [ |
| 266 | + "Epoch: 3 | critic_loss: -0.2783055007457733 | gen_loss: 0.28723031282424927\n" |
| 267 | + ] |
| 268 | + }, |
| 269 | + { |
| 270 | + "name": "stderr", |
| 271 | + "output_type": "stream", |
| 272 | + "text": [ |
| 273 | + " 83%|█████████████████████████████████████▌ | 5/6 [01:38<00:19, 19.25s/it]" |
| 274 | + ] |
| 275 | + }, |
| 276 | + { |
| 277 | + "name": "stdout", |
| 278 | + "output_type": "stream", |
| 279 | + "text": [ |
| 280 | + "Epoch: 4 | critic_loss: -0.22748863697052002 | gen_loss: 0.14494484663009644\n" |
| 281 | + ] |
| 282 | + }, |
| 283 | + { |
| 284 | + "name": "stderr", |
| 285 | + "output_type": "stream", |
| 286 | + "text": [ |
| 287 | + "100%|█████████████████████████████████████████████| 6/6 [01:57<00:00, 19.54s/it]" |
| 288 | + ] |
| 289 | + }, |
| 290 | + { |
| 291 | + "name": "stdout", |
| 292 | + "output_type": "stream", |
| 293 | + "text": [ |
| 294 | + "Epoch: 5 | critic_loss: -0.3456514775753021 | gen_loss: 0.4503900706768036\n" |
| 295 | + ] |
| 296 | + }, |
| 297 | + { |
| 298 | + "name": "stderr", |
| 299 | + "output_type": "stream", |
| 300 | + "text": [ |
| 301 | + "\n" |
| 302 | + ] |
| 303 | + } |
| 304 | + ], |
| 305 | + "source": [ |
| 306 | + "#Init the synthesizer model\n", |
| 307 | + "#n_critic sets the number of updates of the critic network per adversarial training\n", |
| 308 | + "synth = RegularSynthesizer(modelname='cwgangp', model_parameters=model_parameters, n_critic=5)\n", |
| 309 | + "\n", |
| 310 | + "#Model training\n", |
| 311 | + "synth.fit(data=data, \n", |
| 312 | + " label_cols=[\"cardio\"], \n", |
| 313 | + " train_arguments=train_args,\n", |
| 314 | + " num_cols=num_cols, cat_cols=cat_cols)" |
| 315 | + ] |
| 316 | + }, |
| 317 | + { |
| 318 | + "cell_type": "code", |
| 319 | + "execution_count": 8, |
| 320 | + "id": "370b634a", |
| 321 | + "metadata": {}, |
| 322 | + "outputs": [], |
| 323 | + "source": [ |
| 324 | + "#Saving the trained synthesizer\n", |
| 325 | + "synth.save(f'{dataset_name}_synth.pkl')" |
| 326 | + ] |
| 327 | + }, |
| 328 | + { |
| 329 | + "cell_type": "code", |
| 330 | + "execution_count": 9, |
| 331 | + "id": "f019dced-cd65-46b4-ba12-ec58686ce95d", |
| 332 | + "metadata": {}, |
| 333 | + "outputs": [ |
| 334 | + { |
| 335 | + "name": "stderr", |
| 336 | + "output_type": "stream", |
| 337 | + "text": [ |
| 338 | + "2022-08-17 08:56:10.344535: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 140638208 exceeds 10% of free system memory.\n", |
| 339 | + "2022-08-17 08:56:10.404227: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 140638208 exceeds 10% of free system memory.\n", |
| 340 | + "2022-08-17 08:56:10.425273: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 140638208 exceeds 10% of free system memory.\n" |
| 341 | + ] |
| 342 | + } |
| 343 | + ], |
| 344 | + "source": [ |
| 345 | + "cond_array = data[[\"cardio\"]]\n", |
| 346 | + "\n", |
| 347 | + "#Generating a sample with the same size and conditional configuration as the original dataset\n", |
| 348 | + "synth_sample = synth.sample(cond_array)\n", |
| 349 | + "\n", |
| 350 | + "#Saving the synthetic sample as CSV\n", |
| 351 | + "synth_sample.to_csv(f'synth_{dataset_name}')" |
| 352 | + ] |
| 353 | + }, |
| 354 | + { |
| 355 | + "cell_type": "markdown", |
| 356 | + "id": "07ab99d1", |
| 357 | + "metadata": {}, |
| 358 | + "source": [ |
| 359 | + "## Summary & Next Steps" |
| 360 | + ] |
| 361 | + }, |
| 362 | + { |
| 363 | + "cell_type": "markdown", |
| 364 | + "id": "66e44026-ed63-4725-b3a9-60db888fae1f", |
| 365 | + "metadata": {}, |
| 366 | + "source": [ |
| 367 | + "#### Provide here more details\n", |
| 368 | + "\n", |
| 369 | + "Now that we were able to generate succesfully our synthetic data sample we need to assess wether the output data of our synthesizer as enough quality. The quality of synthetic data can be translated into *Fidelity* and *Utility*. \n", |
| 370 | + "\n", |
| 371 | + "In the next notebook we will explore the *Fidelity* of our dataset through:\n", |
| 372 | + "- Synthetic data profiling vs Real data profiling\n", |
| 373 | + "- Running real data suit of expectations" |
| 374 | + ] |
| 375 | + } |
| 376 | + ], |
| 377 | + "metadata": { |
| 378 | + "kernelspec": { |
| 379 | + "display_name": "synthetic", |
| 380 | + "language": "python", |
| 381 | + "name": "synthetic" |
| 382 | + }, |
| 383 | + "language_info": { |
| 384 | + "codemirror_mode": { |
| 385 | + "name": "ipython", |
| 386 | + "version": 3 |
| 387 | + }, |
| 388 | + "file_extension": ".py", |
| 389 | + "mimetype": "text/x-python", |
| 390 | + "name": "python", |
| 391 | + "nbconvert_exporter": "python", |
| 392 | + "pygments_lexer": "ipython3", |
| 393 | + "version": "3.8.13" |
| 394 | + } |
| 395 | + }, |
| 396 | + "nbformat": 4, |
| 397 | + "nbformat_minor": 5 |
| 398 | +} |
0 commit comments