|
| 1 | +# Great Expectations |
| 2 | + |
| 3 | +[Great Expectations](https://greatexpectations.io) is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. With Great Expectations, you can assert what you expect from the data you load and transform, and catch data issues quickly – Expectations are basically *unit tests for your data.* |
| 4 | + |
| 5 | +## About Great Expectations |
| 6 | +*Expectations* are assertions about your data. In Great Expectations, those assertions are expressed in a declarative language in the form of simple, human-readable Python methods. For example, in order to assert that you want values in a column `passenger_count` in your dataset to be integers between 1 and 6, you can say: |
| 7 | + |
| 8 | +```python |
| 9 | +expect_column_values_to_be_between(column="passenger_count", min_value=1, max_value=6) |
| 10 | +``` |
| 11 | + |
| 12 | +Great Expectations then uses this statement to validate whether the column `passenger_count` in a given table is indeed between 1 and 6, and returns a success or failure result. The library currently provides [several dozen highly expressive built-in Expectations](https://greatexpectations.io/expectations/), and allows you to write [custom Expectations](https://docs.greatexpectations.io/docs/guides/expectations/custom_expectations_lp/). |
| 13 | + |
| 14 | +Great Expectations renders Expectations to clean, human-readable documentation called *Data Docs*. These HTML docs contain both your Expectation Suites as well as your data validation results each time validation is run – think of it as a continuously updated data quality report. |
| 15 | + |
| 16 | +## Validating your Synthetic Data with Great Expectations |
| 17 | + |
| 18 | +#### 1. Install the required libraries: |
| 19 | +We recommend you create a virtual environment and install ydata-synthetic and great-expectations by running the following command on your terminal. |
| 20 | + |
| 21 | +```bash |
| 22 | +pip install ydata-synthetic great-expectations |
| 23 | +``` |
| 24 | + |
| 25 | +#### 2. Generate your Synthetic Data: |
| 26 | +In this example, we'll use CTGAN to synthesize samples from the [Adult Census Income](https://www.kaggle.com/datasets/uciml/adult-census-income?resource=download) dataset: |
| 27 | + |
| 28 | +```python |
| 29 | +from pmlb import fetch_data |
| 30 | + |
| 31 | +from ydata_synthetic.synthesizers.regular import RegularSynthesizer |
| 32 | +from ydata_synthetic.synthesizers import ModelParameters, TrainParameters |
| 33 | + |
| 34 | +# Load data and define the data processor parameters |
| 35 | +data = fetch_data('adult') |
| 36 | +num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week'] |
| 37 | +cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', |
| 38 | + 'native-country', 'target'] |
| 39 | + |
| 40 | +# Defining the training parameters |
| 41 | +batch_size = 500 |
| 42 | +epochs = 500+1 |
| 43 | +learning_rate = 2e-4 |
| 44 | +beta_1 = 0.5 |
| 45 | +beta_2 = 0.9 |
| 46 | + |
| 47 | +ctgan_args = ModelParameters(batch_size=batch_size, |
| 48 | + lr=learning_rate, |
| 49 | + betas=(beta_1, beta_2)) |
| 50 | + |
| 51 | +train_args = TrainParameters(epochs=epochs) |
| 52 | +synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args) |
| 53 | +synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols) |
| 54 | + |
| 55 | +# Sample for the trained synthesizer and save the synthetic data |
| 56 | +synth_data = synth.sample(1000) |
| 57 | +synth_data.to_csv('data/adult_synthetic.csv', index=False) |
| 58 | +``` |
| 59 | + |
| 60 | +#### 3. Create a Data Context and Connect to Data: |
| 61 | +Import the `great_expectations` module, create a data context, and connect to your synthetic data: |
| 62 | + |
| 63 | +```python |
| 64 | +import great_expectations as gx |
| 65 | + |
| 66 | +# Initialize data context |
| 67 | +context = gx.get_context() |
| 68 | + |
| 69 | +# Connect to the synthetic data |
| 70 | +validator = context.sources.pandas_default.read_csv( |
| 71 | + "data/adult_synthetic.csv" |
| 72 | +) |
| 73 | +``` |
| 74 | + |
| 75 | +#### 4. Create Expectations: |
| 76 | +You can create Expectation Suites by writing out individual statements, such as the ones below, by using [Profilers and Data Assistants](https://docs.greatexpectations.io/docs/guides/expectations/profilers_data_assistants_lp) or even [Custom Profilers](https://docs.greatexpectations.io/docs/guides/expectations/advanced/how_to_create_a_new_expectation_suite_using_rule_based_profilers/). |
| 77 | + |
| 78 | +```python |
| 79 | +# Create expectations |
| 80 | +validator.expect_column_values_to_not_be_null("age") |
| 81 | +validator.expect_column_values_to_be_between("workclass", auto=True) |
| 82 | +validator.save_expectation_suite() |
| 83 | +``` |
| 84 | + |
| 85 | +#### 5. Validate Data |
| 86 | +To validate your data, define a checkpoint and examine the data to determine if it matches the defined Expectations: |
| 87 | + |
| 88 | +```python |
| 89 | +# Validate the synthetic data |
| 90 | +checkpoint = context.add_or_update_checkpoint( |
| 91 | + name="synthetic_data_checkpoint", |
| 92 | + validator=validator, |
| 93 | +) |
| 94 | +``` |
| 95 | +You can run the validations results: |
| 96 | + |
| 97 | +```python |
| 98 | +checkpoint_result = checkpoint.run() |
| 99 | +``` |
| 100 | + |
| 101 | +And use the following code to view an HTML representation of the Validation results: |
| 102 | + |
| 103 | +```python |
| 104 | +context.view_validation_result(checkpoint_result) |
| 105 | +``` |
| 106 | + |
| 107 | + |
0 commit comments