# Validate synthetic data via Train-Synthetic-Test-Real <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/train-synthetic-test-real/TSTR.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

In this tutorial, we demonstrate the process of evaluating the quality of synthetic data based on its utility for a downstream Machine Learning (ML) task. The method is commonly referred to as the Train-Synthetic-Test-Real (TSTR) evaluation [[1](#refs)]. The TSTR evaluation serves as a robust measure of synthetic data quality because ML models rely on the accurate representation of deeper underlying patterns to perform effectively on previously unseen data. As a result, this approach offers a more reliable assessment than simply evaluating higher-level statistics.

See image below for the general setup of TSTR.

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/train-synthetic-test-real/TSTR.png' width="600px"/>

Thus, we take actual (=real) data, and split it into a holdout and a training dataset. Next, we create a synthetic dataset only based on the training data. Then we train a Machine Learning (ML) model, and do so once using the synthetic data and once using the actual training data. And finally we evaluate the performance of each of those two models on top of the actual holdout data, that was kept aside all along. By comparing the performance of these two models, we can assess how much utility has been retained by the synthesization method with respect to a specific ML task.

Note, that one needs to use a true holdout for the evaluation to properly measure out-of-sample performance, as this is the relevant metric for real-world use cases. If one uses the same training data that has been used for the synthesis, one would "leak" information from training into evaluation. This becomes particularly an issue for synthesizers that are prone to overfitting, and simply memorize the samples that it has been exposed to. If one, on the other hand, were to use synthetic data for the evaluation, one would not get meaningful results either, as the synthetic data might not be representative of the real data. E.g., consider the degenerate case of a synthesizer that only produces the same record over and over again. Any model trained on that data, would yield perfect results when evaluated on it again, whereas it will be of no use when applied to real data.

## Synthesize Data via MOSTLY AI

For this tutorial, we will be using a cleaned up version of the UCI Adult Income [[2](#refs)] dataset, that itself stems from the 1994 American Community Survey [[3](#refs)] by the US census bureau. The dataset consists of 48,842 records, 14 mixed-type features and has 1 target variable, that indicates whether a respondent had or had not reported a high level of annual income. This dataset is being selected, as it's one of the go-to datasets commonly used to showcase machine learning models in action.

In [None]:
%pip install -U mostlyai  # or: pip install -U 'mostlyai[local]'
%pip install scikit-learn seaborn lightgbm

In [None]:
import pandas as pd

# fetch original data
df = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz")
df

In [None]:
from sklearn.model_selection import train_test_split

# split into training and validation
df_trn, df_hol = train_test_split(df, test_size=0.2, random_state=1)

print(f"training data with {df_trn.shape[0]:,} records and {df_trn.shape[1]} attributes")
print(f"holdout data with {df_hol.shape[0]:,} records and {df_hol.shape[1]} attributes")

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

# train a generator on the original training data
g = mostly.train(data=df_trn, name="TSTR Tutorial Census")

# probe the generator for synthetic data
syn = mostly.probe(g, size=len(df))
print(f"Created synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")

## Explore Synthetic Data

Show 10 randomly sampled synthetic records. Note, that you can execute the following cell multiple times, to see different samples.

In [None]:
syn.sample(n=10)

Show 5 randomly sampled Female Professors of age 30 or younger.

In [None]:
syn.loc[(syn["sex"] == "Female") & (syn["education"] == "Prof-school") & (syn["age"] <= 30)].sample(n=5)

Count low-income (<=50K) and high-income (>50K) records within the synhetic sample.

In [None]:
syn["income"].value_counts()

Count low-income and high-income records among the group of non-US citizen, that have been divorced.

In [None]:
syn.loc[(syn["native_country"] != "United-States") & (syn["marital_status"] == "Divorced")]["income"].value_counts()

## Compare ML Performance

Let's now train a state-of-the-art **LightGBM** classifier on top of the synthetic data, to then check how well it can predict whether an actual person reported an annual income of more than $50K or not. We will then compare the predictive accuracy to a model, that has been trained on the actual data, and see whether we were able to achieve a similar performance purely based on the synthetic data.

In [None]:
import lightgbm as lgb
from lightgbm import early_stopping
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams["figure.dpi"] = 72

target_col = "income"
target_val = ">50K"


def prepare_xy(df):
    y = (df[target_col] == target_val).astype(int)
    str_cols = [col for col in df.select_dtypes(["object", "string"]).columns if col != target_col]
    for col in str_cols:
        df[col] = pd.Categorical(df[col])
    cat_cols = [col for col in df.select_dtypes("category").columns if col != target_col]
    num_cols = [col for col in df.select_dtypes("number").columns if col != target_col]
    for col in num_cols:
        df[col] = df[col].astype("float")
    X = df[cat_cols + num_cols]
    return X, y


def train_model(X, y):
    cat_cols = list(X.select_dtypes("category").columns)
    X_trn, X_val, y_trn, y_val = train_test_split(X, y, test_size=0.2, random_state=1)
    ds_trn = lgb.Dataset(X_trn, label=y_trn, categorical_feature=cat_cols, free_raw_data=False)
    ds_val = lgb.Dataset(X_val, label=y_val, categorical_feature=cat_cols, free_raw_data=False)
    model = lgb.train(
        params={"verbose": -1, "metric": "auc", "objective": "binary"},
        train_set=ds_trn,
        valid_sets=[ds_val],
        callbacks=[early_stopping(5)],
    )
    return model


def evaluate_model(model, hol):
    X_hol, y_hol = prepare_xy(hol)
    probs = model.predict(X_hol)
    preds = (probs >= 0.5).astype(int)
    auc = roc_auc_score(y_hol, probs)
    acc = accuracy_score(y_hol, preds)
    probs_df = pd.concat(
        [
            pd.Series(probs, name="probability").reset_index(drop=True),
            pd.Series(y_hol, name=target_col).reset_index(drop=True),
        ],
        axis=1,
    )
    sns.displot(data=probs_df, x="probability", hue=target_col, bins=20, multiple="stack")
    plt.title(f"Accuracy: {acc:.1%}, AUC: {auc:.1%}", fontsize=20)
    plt.show()
    return auc


import warnings

warnings.filterwarnings("ignore")

### Train a Model on Synthetic Data - Test on Real Data

We train the LightGBM on synthetic data, and then evaluate its performance on holdout data. We report two performance metrics: 
1. **Accuracy**: This is the probability to correctly predict the `income` class of a randomly selected record.
2. **AUC** (Area-Under-Curve): This is the probability to correctly predict the `income` class, if two records, one of high-income and one of low-income are given.

Whereas the Accuracy informs about the overall ability to get the class attribution correct, the AUC specifically informs about the ability to properly rank records, with respect to their probability of being within the target class or not. In both cases, the higher the metric, the better the predictive accuracy of the model.

The displayed chart shows the distribution of scores, that the model assigned to each of the holdout records. A score close to 0 means that model is very confident, that the record is of low income. A score close to 1 means that the model is very confident that it's a high income record. These scores are further split by their actual outcome, i.e. whether they are or are not actually high income. This allows to visually inspect the model's confidence in assigning the right scores.

In [None]:
# prepare synthetic data, and split into features `X` and target `y`
X_syn, y_syn = prepare_xy(syn)
# train ML model on synthetic data with early stopping to prevent overfitting
model_syn = train_model(X_syn, y_syn)
# evaluate trained model on original holdout data
auc_syn = evaluate_model(model_syn, df_hol)

### Train a Model on Real Data - Test on Real Data

Let's now compare these results achieved on synthetic data, with a model trained on real data. For a very good synthesizer, we expect to see a predictive performance of the two models being close to each other.

In [None]:
# prepare original training data, and split into features `X` and target `y`
X_trn, y_trn = prepare_xy(df_trn)
# train ML model on original training data with early stopping to prevent overfitting
model_trn = train_model(X_trn, y_trn)
# evaluate trained model on original holdout data
auc_trn = evaluate_model(model_trn, df_hol)

## Conclusion

For the given dataset, and the given synthesizer, we can observe a near on-par performance of the synthetic data with respect to the given downstream ML task. This means, that one can train the model purely on synthetic data, and yield just as good results as if it were trained on real data, but without ever putting the privacy of any of the contained individuals at any risk.

## Further exercises

In addition to walking through the above instructions, we suggest..
* to run Train-Synthetic-Test-Real 
  * using a different dataset, eg. the UCI bank-marketing dataset [[4](#refs)]
  * using a different downstream ML model, eg. a RandomForest model [[5](#refs)]
  * using a different synthesizer, eg. SynthCity, SDV, etc.
* to check the impact of synthetic upsampling
  * generate 10x or 100x the original data records, and see whether it improves ML accuracy

## References<a class="anchor" name="refs"></a>

1. https://arxiv.org/pdf/1706.02633.pdf ยง3.1.2
1. https://archive.ics.uci.edu/ml/datasets/adult
1. https://www.census.gov/programs-surveys/acs
1. https://archive.ics.uci.edu/ml/datasets/bank+marketing
1. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html