Skip to content

Commit 387cf77

Browse files
docs: add integration example with great expectations (#285)
1 parent eb1b574 commit 387cf77

2 files changed

Lines changed: 109 additions & 0 deletions

File tree

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Great Expectations
2+
3+
[Great Expectations](https://greatexpectations.io) is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. With Great Expectations, you can assert what you expect from the data you load and transform, and catch data issues quickly – Expectations are basically *unit tests for your data.*
4+
5+
## About Great Expectations
6+
*Expectations* are assertions about your data. In Great Expectations, those assertions are expressed in a declarative language in the form of simple, human-readable Python methods. For example, in order to assert that you want values in a column `passenger_count` in your dataset to be integers between 1 and 6, you can say:
7+
8+
```python
9+
expect_column_values_to_be_between(column="passenger_count", min_value=1, max_value=6)
10+
```
11+
12+
Great Expectations then uses this statement to validate whether the column `passenger_count` in a given table is indeed between 1 and 6, and returns a success or failure result. The library currently provides [several dozen highly expressive built-in Expectations](https://greatexpectations.io/expectations/), and allows you to write [custom Expectations](https://docs.greatexpectations.io/docs/guides/expectations/custom_expectations_lp/).
13+
14+
Great Expectations renders Expectations to clean, human-readable documentation called *Data Docs*. These HTML docs contain both your Expectation Suites as well as your data validation results each time validation is run – think of it as a continuously updated data quality report.
15+
16+
## Validating your Synthetic Data with Great Expectations
17+
18+
#### 1. Install the required libraries:
19+
We recommend you create a virtual environment and install ydata-synthetic and great-expectations by running the following command on your terminal.
20+
21+
```bash
22+
pip install ydata-synthetic great-expectations
23+
```
24+
25+
#### 2. Generate your Synthetic Data:
26+
In this example, we'll use CTGAN to synthesize samples from the [Adult Census Income](https://www.kaggle.com/datasets/uciml/adult-census-income?resource=download) dataset:
27+
28+
```python
29+
from pmlb import fetch_data
30+
31+
from ydata_synthetic.synthesizers.regular import RegularSynthesizer
32+
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
33+
34+
# Load data and define the data processor parameters
35+
data = fetch_data('adult')
36+
num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
37+
cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex',
38+
'native-country', 'target']
39+
40+
# Defining the training parameters
41+
batch_size = 500
42+
epochs = 500+1
43+
learning_rate = 2e-4
44+
beta_1 = 0.5
45+
beta_2 = 0.9
46+
47+
ctgan_args = ModelParameters(batch_size=batch_size,
48+
lr=learning_rate,
49+
betas=(beta_1, beta_2))
50+
51+
train_args = TrainParameters(epochs=epochs)
52+
synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)
53+
synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)
54+
55+
# Sample for the trained synthesizer and save the synthetic data
56+
synth_data = synth.sample(1000)
57+
synth_data.to_csv('data/adult_synthetic.csv', index=False)
58+
```
59+
60+
#### 3. Create a Data Context and Connect to Data:
61+
Import the `great_expectations` module, create a data context, and connect to your synthetic data:
62+
63+
```python
64+
import great_expectations as gx
65+
66+
# Initialize data context
67+
context = gx.get_context()
68+
69+
# Connect to the synthetic data
70+
validator = context.sources.pandas_default.read_csv(
71+
"data/adult_synthetic.csv"
72+
)
73+
```
74+
75+
#### 4. Create Expectations:
76+
You can create Expectation Suites by writing out individual statements, such as the ones below, by using [Profilers and Data Assistants](https://docs.greatexpectations.io/docs/guides/expectations/profilers_data_assistants_lp) or even [Custom Profilers](https://docs.greatexpectations.io/docs/guides/expectations/advanced/how_to_create_a_new_expectation_suite_using_rule_based_profilers/).
77+
78+
```python
79+
# Create expectations
80+
validator.expect_column_values_to_not_be_null("age")
81+
validator.expect_column_values_to_be_between("workclass", auto=True)
82+
validator.save_expectation_suite()
83+
```
84+
85+
#### 5. Validate Data
86+
To validate your data, define a checkpoint and examine the data to determine if it matches the defined Expectations:
87+
88+
```python
89+
# Validate the synthetic data
90+
checkpoint = context.add_or_update_checkpoint(
91+
name="synthetic_data_checkpoint",
92+
validator=validator,
93+
)
94+
```
95+
You can run the validations results:
96+
97+
```python
98+
checkpoint_result = checkpoint.run()
99+
```
100+
101+
And use the following code to view an HTML representation of the Validation results:
102+
103+
```python
104+
context.view_validation_result(checkpoint_result)
105+
```
106+
107+

mkdocs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ nav:
2121
- CWGAN-GP: "examples/cwgangp_example.md"
2222
- Generate Time-Series Data:
2323
- TimeGAN: "examples/timegan_example.md"
24+
- Integrations:
25+
- Great Expectations: "integrations/gx_integration.md"
2426
- Support:
2527
- Help & Troubleshooting: 'support/help-troubleshooting.md'
2628
- Contribution Guidelines: 'support/contribute.md'

0 commit comments

Comments
 (0)