# Develop a Fake or Real Discriminator <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/fake-or-real/fake-or-real.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

In this notebook, we walk through the steps of developing a machine learning model that is trained to distinguish between fake (=synthetic) and real records. The model's ability to correctly discriminate between these on an unseen holdout can serve us as another helpful quality criteria for the generated synthetic data. The more realistic those synthetic records are, the harder it will be for any discriminator to tell these apart from the real records.

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/fake-or-real/fake-or-real.png' width="400px"/>

In order to make the analysis more interesting, we intentionally create synthetic data of lower quality, by limiting the training samples to only a thousand records. Otherwise, the discriminator would not be able to find much signal, if the synthesizer, like MOSTLY AI Synthetic Data SDK, is of very high quality.

## Synthesize Data via MOSTLY AI

For this tutorial, we will be using again the UCI Adult Income [[1](#refs)] dataset, which consists of 48,842 records across 15 attributes.

We will use the Synthetic Data SDK to create a Generator and then use that Generator to create a Synthetic dataset.

In [None]:
%pip install -U mostlyai  # or: pip install -U 'mostlyai[local]'
%pip install scikit-learn seaborn lightgbm

In [None]:
import pandas as pd

# fetch original data
df_tgt = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz")
df_tgt

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

In [None]:
# create generator with a sample of 5,000
g = mostly.train(
    config={
        "name": "Fake vs. Real Tutorial - Census Data",
        "tables": [
            {
                "name": "data",
                "data": df_tgt,
                "tabular_model_configuration": {
                    "max_sample_size": 5_000,
                    "max_training_time": 2,
                },
            }
        ],
    }
)

In [None]:
# Generate synthetic data
syn = mostly.probe(g, size=5_000)
print(f"Created synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")

## Train Discriminator

In [None]:
import lightgbm as lgb
from lightgbm import early_stopping
from sklearn.model_selection import train_test_split


def prepare_xy(df, target_col, target_val):
    # split target variable `y`
    y = (df[target_col] == target_val).astype(int)
    # convert strings to categoricals, and all others to floats
    str_cols = [col for col in df.select_dtypes(["object", "string"]).columns if col != target_col]
    for col in str_cols:
        df[col] = pd.Categorical(df[col])
    cat_cols = [col for col in df.select_dtypes("category").columns if col != target_col]
    num_cols = [col for col in df.select_dtypes("number").columns if col != target_col]
    for col in num_cols:
        df[col] = df[col].astype("float")
    X = df[cat_cols + num_cols]
    return X, y


def train_model(X, y):
    cat_cols = list(X.select_dtypes("category").columns)
    X_trn, X_val, y_trn, y_val = train_test_split(X, y, test_size=0.2, random_state=1)
    ds_trn = lgb.Dataset(X_trn, label=y_trn, categorical_feature=cat_cols, free_raw_data=False)
    ds_val = lgb.Dataset(X_val, label=y_val, categorical_feature=cat_cols, free_raw_data=False)
    model = lgb.train(
        params={"verbose": -1, "metric": "auc", "objective": "binary"},
        train_set=ds_trn,
        valid_sets=[ds_val],
        callbacks=[early_stopping(5)],
    )
    return model


import warnings

warnings.filterwarnings("ignore")

In [None]:
# concatenate FAKE and REAL data together
df = pd.concat(
    [
        df_tgt.assign(split="REAL"),
        syn.assign(split="FAKE"),
    ],
    axis=0,
)
df.insert(0, "split", df.pop("split"))
df.groupby("split").size()

In [None]:
# take a 20% holdout dataset aside for evaluation
trn, hol = train_test_split(df, test_size=0.2, random_state=1)

In [None]:
# train the discriminator on the remaining 80% training dataset
X_trn, y_trn = prepare_xy(trn, "split", "FAKE")
model = train_model(X_trn, y_trn)

In [None]:
# score the model on the holdout dataset, assigning a probability to each record on whether it's FAKE or REAL
X_hol, y_hol = prepare_xy(hol, "split", "FAKE")
hol.insert(1, "is_fake", model.predict(X_hol).round(4))

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, accuracy_score

auc = roc_auc_score(y_hol, hol.is_fake)
acc = accuracy_score(y_hol, (hol.is_fake > 0.5).astype(int))
probs_df = pd.concat(
    [
        pd.Series(hol.is_fake, name="probability").reset_index(drop=True),
        pd.Series(y_hol, name="target").reset_index(drop=True),
    ],
    axis=1,
)
fig = sns.displot(data=probs_df, x="probability", hue="target", bins=20, multiple="stack")
fig = plt.title(f"Accuracy: {acc:.1%}, AUC: {auc:.1%}")
plt.show()

As you can see from above chart, the discriminator has learned to pick up some signals that allow it with a varying level of confidence to determine whether a record is FAKE or REAL. 

The AUC can be interpreted as the percentage of cases, where the discriminator is able to correctly spot the FAKE record, given a set of a FAKE and a REAL record.

#### Sample records, that seem very FAKE

In [None]:
hol.sort_values("is_fake").tail(n=100).sample(n=5)

In these cases, it is the mismatch between `education` and `education_num` that gives away the fact that these are FAKE. E.g., in the original data, education level `Assoc-acdm` was mapped to education number 12, whereas in the synthetic data we see various other numeric values.

In [None]:
pd.crosstab(df_tgt.education, df_tgt.education_num)

In [None]:
pd.crosstab(syn.education, syn.education_num)

#### Sample records, that seem very REAL

I.e. these are type of records, that the synthesizer has apparently failed to create. Thus, as they are then absent from the synthetic data, the discriminator recognizes these as REAL.

In [None]:
hol.sort_values("is_fake").head(n=100).sample(n=5)

## Conclusion

This tutorial has shown how to train a discriminator that is set out to distinguish between FAKE and REAL records. The better the quality of the generated synthetic data, the less likely the discriminator (as well as we humans) can tell them apart.

## Further exercises

In addition to walking through the above instructions, we suggest..
* measuring the Discriminator's AUC if more training samples are used
* using a different dataset, eg. the UCI bank-marketing dataset [[2](#refs)]
* using a different ML model for the discriminator, eg. a RandomForest model [[3](#refs)]
* using a different synthesizer, eg. SynthCity, SDV, etc.

## References<a class="anchor" name="refs"></a>

1. https://archive.ics.uci.edu/ml/datasets/adult
1. https://archive.ics.uci.edu/ml/datasets/bank+marketing
1. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html