# Size vs. Accuracy Trade-Off <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/size-vs-accuracy/size-vs-accuracy.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

In this exercise, we are going to explore the relationship between the number of training samples, that are being used for the synthesis, and the corresponding accuracy of the generated synthetic data. We expect to see a higher accuracy for an increasing number of training samples. But along with a larger number of training samples, we will also see an increase in computational effort, i.e. overall runtime.

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/size-vs-accuracy/size-vs-accuracy.png' width="600px"/>

Note, that we shall not expect synthetic data to perfectly match the original data. This would only be satisfied by a copy of the data, which obviously would neither satisfy any privacy requirements nor would provide any novel samples. That being said, we shall expect that due to sampling variance the synthetic data can deviate. Ideally, just as much, and not more than the deviation that we would observe by analyzing an actual holdout data.

## Synthesize Data via MOSTLY AI

For this tutorial, we will be using the same UCI Adult Income [[1](#refs)] dataset, that was used in the Train-Synthetic-Test-Real tutorial. Thus, we have in total 48,842 records across 15 attributes, and will be using up to 39,073 (=80%) of those records for the creation of Generators.

The following code creates different Generators, each time with a different number of maximum training samples. E.g. 100, 400, 1,600, 6,400, 25,600. Feel free to adjust these numbers as you are experimenting. Subsequently different Synthetic Datasets based on the Generators are created.

In [None]:
%pip install -U mostlyai  # or: pip install -U 'mostlyai[local]'
%pip install scikit-learn seaborn lightgbm

In [None]:
import pandas as pd

# fetch original data
df = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz")
df

In [None]:
from sklearn.model_selection import train_test_split

# split into training and validation
df_trn, df_hol = train_test_split(df, test_size=0.2, random_state=1)

print(f"training data with {df_trn.shape[0]:,} records and {df_trn.shape[1]} attributes")
print(f"holdout data with {df_hol.shape[0]:,} records and {df_hol.shape[1]} attributes")

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

# create Generators with different sample sizes
g_200 = mostly.train(data=df_trn.sample(200), name="census_200")
g_400 = mostly.train(data=df_trn.sample(400), name="census_400")
g_1600 = mostly.train(data=df_trn.sample(1600), name="census_1600")
g_6400 = mostly.train(data=df_trn.sample(6400), name="census_6400")
g_25600 = mostly.train(data=df_trn.sample(25600), name="census_25600")

In [None]:
# Generate synthetic data
synthetic_data = {
    "syn_200": mostly.probe(g_200, size=10_000),
    "syn_400": mostly.probe(g_400, size=10_000),
    "syn_1600": mostly.probe(g_1600, size=10_000),
    "syn_6400": mostly.probe(g_6400, size=10_000),
    "syn_25600": mostly.probe(g_25600, size=10_000),
}

Now go to the UI of MOSTLY AI, look at the created Generators and take notes of the reported runtime of each training step, and update the following DataFrame accordingly. The overall accuracy of the created Generators is loaded automatically.

In [None]:
results = pd.DataFrame(
    [
        {"samples": 200, "accuracy": g_200.accuracy, "trainingtime": 2},
        {"samples": 400, "accuracy": g_400.accuracy, "trainingtime": 5},
        {"samples": 1600, "accuracy": g_1600.accuracy, "trainingtime": 24},
        {"samples": 6400, "accuracy": g_6400.accuracy, "trainingtime": 56},
        {"samples": 25600, "accuracy": g_25600.accuracy, "trainingtime": 73},
    ]
)
results

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.catplot(data=results, y="accuracy", x="samples", kind="point", color="black")
plt.xticks(rotation=45)
plt.xlabel("")
plt.title("QA Report - Overall Accuracy")
plt.show()

## Explore Synthetic Data

Show 3 randomly sampled synthetic records for each of the datasets. Note, that you can execute the following cell multiple times, to see different samples. 

In [None]:
for generator, df in synthetic_data.items():
    print("===", generator, "===")
    display(df.sample(n=3))

## Quality Assessment

Concatenate all datasets together to ease comparions across these.

In [None]:
# combine synthetics
df = pd.concat([d.assign(split=k) for k, d in synthetic_data.items()], axis=0)
df["split"] = pd.Categorical(df["split"], categories=df["split"].unique())
df.insert(0, "split", df.pop("split"))
# combine synthetics and original
dataset = synthetic_data | {"training": df_trn, "holdout": df_hol}
df_all = pd.concat([d.assign(split=k) for k, d in dataset.items()], axis=0)
df_all["split"] = pd.Categorical(df_all["split"], categories=df_all["split"].unique())
df_all.insert(0, "split", df_all.pop("split"))

### Compare single statistic

The more training samples have been used for the synthesis, the closer the synthetic distributions are expected to be to the original ones.

Note, that we can also see deviations within statistics between the target and the holdout data. This is expected due to the sampling variance. The smaller the dataset, the larger the sampling variance will be.

#### Average number of Hours-Per-Week, split by Gender

In [None]:
stats = (
    df_all.groupby(["split", "sex"], observed=True)["hours_per_week"].mean().round(1).to_frame().reset_index(drop=False)
)
stats = stats.pivot_table(index="split", columns=["sex"], observed=True).reset_index(drop=False)
stats

#### Average Age, split by Marital Status

In [None]:
stats = (
    df_all.groupby(["split", "marital_status"], observed=True)["age"].mean().round().to_frame().reset_index(drop=False)
)
stats = stats.loc[~stats["marital_status"].isin(["_RARE_", "Married-AF-spouse", "Married-spouse-absent", "Separated"])]
stats = stats.pivot_table(index="split", columns="marital_status", values="age", observed=True).reset_index()
stats

#### Age distribution, split by Income

In [None]:
sns.catplot(data=df_all, x="age", y="split", hue="income", kind="violin", split=True)
plt.show()

### Check rule adherence

The original data has a 1:1 relationship between `education` and `education_num`. Let's check in how many cases the generated synthetic data has correctly retained that specific rule between these two columns.

In [None]:
# display unique combinations of `education` and `education_num`
df_trn[["education", "education_num"]].drop_duplicates().sort_values("education_num").reset_index(drop=True)

In [None]:
# Convert `education` to Categorical with proper sort order
df["education"] = pd.Categorical(df["education"], categories=df_trn.sort_values("education_num")["education"].unique())

# Calculate the correct match, explicitly excluding the group keys from the apply operation
stats = (
    df.groupby("split", observed=True)
    .apply(lambda x: ((x["education"].cat.codes + 1) == x["education_num"]).mean())
    .reset_index(name="matches")
)

stats

In [None]:
sns.catplot(data=stats, y="matches", x="split", kind="point", color="black")
plt.xticks(rotation=45)
plt.xlabel("")
plt.title("Share of Matches")
plt.show()

### Compare ML performance via TSTR

Let's perform a Train-Synthetic-Test-Real evaluation via a downstream LightGBM classifier.

In [None]:
import lightgbm as lgb
from lightgbm import early_stopping
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

target_col = "income"
target_val = ">50K"


# prepare data, and split into features `X` and target `y`
def prepare_xy(df: pd.DataFrame):
    y = (df[target_col] == target_val).astype(int)
    str_cols = [col for col in df.select_dtypes(["object", "string"]).columns if col != target_col]
    for col in str_cols:
        df[col] = pd.Categorical(df[col])
    cat_cols = [col for col in df.select_dtypes("category").columns if col != target_col]
    num_cols = [col for col in df.select_dtypes("number").columns if col != target_col]
    for col in num_cols:
        df[col] = df[col].astype("float")
    X = df[cat_cols + num_cols]
    return X, y


# train ML model with early stopping
def train_model(X, y):
    cat_cols = list(X.select_dtypes("category").columns)
    X_trn, X_val, y_trn, y_val = train_test_split(X, y, test_size=0.2, random_state=1)
    ds_trn = lgb.Dataset(X_trn, label=y_trn, categorical_feature=cat_cols, free_raw_data=False)
    ds_val = lgb.Dataset(X_val, label=y_val, categorical_feature=cat_cols, free_raw_data=False)
    model = lgb.train(
        params={"verbose": -1, "metric": "auc", "objective": "binary"},
        train_set=ds_trn,
        valid_sets=[ds_val],
        callbacks=[early_stopping(5)],
    )
    return model


# apply ML Model to some holdout data, report key metrics, and visualize scores
def evaluate_model(model, hol):
    X_hol, y_hol = prepare_xy(hol)
    probs = model.predict(X_hol)
    auc = roc_auc_score(y_hol, probs)
    return auc


def train_and_evaluate(df):
    X, y = prepare_xy(df)
    model = train_model(X, y)
    auc = evaluate_model(model, df_hol)
    return auc


import warnings

warnings.filterwarnings("ignore")

In [None]:
aucs = {k: train_and_evaluate(df) for k, df in synthetic_data.items()}
aucs = pd.Series(aucs).round(3).to_frame("auc").reset_index()

In [None]:
sns.catplot(data=aucs, y="auc", x="index", kind="point", color="black")
plt.xticks(rotation=45)
plt.xlabel("")
plt.title("Predictive Performance (AUC) on Holdout")
plt.show()

## Conclusion

For the given dataset and the given synthesizer we can indeed observe an increase in synthetic data quality with a growing number of training samples. This can be measured with respect to accuracy, as well as ML utility.

As we can also observe, is that a holdout dataset will exhibit deviations from the training data due to the sampling noise as well. With the holdout data being actual data, that hasn't been seen before, it serves us as a north star in terms of maximum acchievable accuracy for synthetic data. See our paper on this subject [[2](#refs)].

## Further exercises

In addition to walking through the above instructions, we suggest..
* to limit model training to a few epochs, e.g. by setting the maximum number of epochs to 1 or 5 and study its impact on runtime and quality.
* to synthesize with different model_sizes: Small, Medium and Large, and study its impact on runtime and quality.
* to synthesize with the same settings several times, and with that study the variability in quality across several runs.
* to calculate and compare your own statistics, and then compare the deviations between synthetic and training. The deviations between holdout and training can serve as a benchmark .

## References<a class="anchor" name="refs"></a>

1. https://archive.ics.uci.edu/ml/datasets/adult
1. https://www.frontiersin.org/articles/10.3389/fdata.2021.679939/full