# Explore the Benefits of Rebalancing <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/rebalancing/rebalancing.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

In this exercise, we are going to explore the benefits of synthetic rebalancing of heavily imbalanced datasets, where a minority class of interest accounts for less than 0.1% of cases.

Rebalancing can be useful for cases where we want to learn more of an otherwise small or underrepresented population segment by seeing more examples thereof. Of course, also a synthesizer can only leverage the data that it has seen. But if the method is data-efficient, and in particular more effective than the downstream data consumer, then it is possible to gain a significant advantage by synthetic rebalancing.

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/rebalancing/rebalancing.png' width="600px"/>

In terms of evaluation, we again turn towards the Train-Synthetic-Test-Real approach to benchmark the predictive accuracy of a model that is trained on the (rebalanced) synthetic data, and compare that to a model trained on the (imbalanced) actual data. In addition, we will also benchmark against established methods for rebalancing, like naive upsampling as well as SMOTE. All four models are then evaluated on a holdout data, and compared in terms of predictive performance.

## Synthesize Data via MOSTLY AI

For this tutorial, we will be using again the UCI Adult Income [[1](#refs)] dataset, as well as the same training and validation split, that was used in the Train-Synthetic-Test-Real tutorial. However, we will create an artificial imbalance of 0.1% of high-income records in the training data, by downsampling the minority class.

The code below will automatically create a rebalanced synthetic dataset using the MOSTLY AI Synthetic Data SDK.

In [None]:
%pip install -U mostlyai  # or: pip install -U 'mostlyai[local]'
%pip install scikit-learn seaborn lightgbm imblearn

In [None]:
import pandas as pd

# fetch original data
df = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz")
df

In [None]:
from sklearn.model_selection import train_test_split

# split into training and validation
df_trn, df_hol = train_test_split(df, test_size=0.2, random_state=1)

print(f"training data with {df_trn.shape[0]:,} records and {df_trn.shape[1]} attributes")
print(f"holdout data with {df_hol.shape[0]:,} records and {df_hol.shape[1]} attributes")

In [None]:
# create an artificial imbalance of 0.1% of high-income records in the training data, by downsampling the minority class


def create_imbalance(df, target, ratio):
    val_min, val_maj = df[target].value_counts().sort_values().index
    df_maj = df.loc[df[target] == val_maj]
    n_min = int(df_maj.shape[0] / (1 - ratio) * ratio)
    df_min = df.loc[df[target] == val_min].sample(n=n_min, random_state=1)
    df_maj = df.loc[df[target] == val_maj]
    df_imb = pd.concat([df_min, df_maj]).sample(frac=1, random_state=1)
    return df_imb


trn = create_imbalance(df_trn, "income", 1 / 1000)
print(f"Created imbalanced training data with {trn.shape[0]:,} records and {trn.shape[1]} attributes")

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

In [None]:
# train a generator on the original training data
g = mostly.train(
    config={
        "name": "Rebalancing Tutorial Census",
        "tables": [
            {
                "name": "data",
                "data": trn,
                "tabular_model_configuration": {
                    "max_training_time": 5,
                },
            }
        ],
    }
)

In [None]:
# generate a synthetic dataset with rebalancing of the income column to 50% ">50K" category
sd = mostly.generate(
    generator=g,
    config={
        "name": "Rebalancing Tutorial Census",
        "tables": [
            {"name": "data", "configuration": {"rebalancing": {"column": "income", "probabilities": {">50K": 0.5}}}}
        ],
    },
)

# start using it
syn = sd.data()
print(f"Created synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")

## Explore Synthetic Data

### Samples Random Records

Let's first show 10 randomly sampled original records, from the imbalanced dataset. Try executing the cell multiple times, to see different samples. Still, due to the strong imbalance, you will hardly ever encounter a sample of the high income class (i.e. `income` being `>50K`).

In [None]:
trn.sample(n=10)

Let's now display 10 randomly sampled synthetic records. Again, run the cell multiple times. This time, you should see that the records are evenly distributed across the two `income` classes.

In [None]:
syn.sample(n=10)

### Sample Female Doctors with a High Income

Let's now investigate all female doctors with a high income. But, it turns out there are actually none in the original data, thus we won't be able to learn anything.

In [None]:
trn[(trn["income"] == ">50K") & (trn.sex == "Female") & (trn.education == "Doctorate")]

However, the synthetic data does contain a list of realistic, statistically sound female doctors with a high income, that allow to learn about this particular subsegment.

In [None]:
syn[(syn["income"] == ">50K") & (syn.sex == "Female") & (syn.education == "Doctorate")].head()

## Compare ML Performance via TSTR

In [None]:
import lightgbm as lgb
from lightgbm import early_stopping
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score
import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams["figure.dpi"] = 72

target_col = "income"
target_val = ">50K"


def prepare_xy(df: pd.DataFrame):
    y = (df[target_col] == target_val).astype(int)
    str_cols = [col for col in df.select_dtypes(["object", "string"]).columns if col != target_col]
    for col in str_cols:
        df[col] = pd.Categorical(df[col])
    cat_cols = [col for col in df.select_dtypes("category").columns if col != target_col]
    num_cols = [col for col in df.select_dtypes("number").columns if col != target_col]
    for col in num_cols:
        df[col] = df[col].astype("float")
    X = df[cat_cols + num_cols]
    return X, y


def train_model(X, y):
    cat_cols = list(X.select_dtypes("category").columns)
    X_trn, X_val, y_trn, y_val = train_test_split(X, y, test_size=0.2, random_state=1)
    ds_trn = lgb.Dataset(X_trn, label=y_trn, categorical_feature=cat_cols, free_raw_data=False)
    ds_val = lgb.Dataset(X_val, label=y_val, categorical_feature=cat_cols, free_raw_data=False)
    model = lgb.train(
        params={"verbose": -1, "metric": "auc", "objective": "binary"},
        train_set=ds_trn,
        valid_sets=[ds_val],
        callbacks=[early_stopping(5)],
    )
    return model


def evaluate_model(model, hol):
    X_hol, y_hol = prepare_xy(hol)
    probs = model.predict(X_hol)
    auc = roc_auc_score(y_hol, probs)
    f1 = f1_score(y_hol, probs > 0.5, average="macro")
    probs_df = pd.concat(
        [
            pd.Series(probs, name="probability").reset_index(drop=True),
            pd.Series(y_hol, name=target_col).reset_index(drop=True),
        ],
        axis=1,
    )
    sns.displot(data=probs_df, x="probability", hue=target_col, bins=20, multiple="stack")
    plt.title(f"AUC: {auc:.1%}, F1 Score: {f1:.2f}", fontsize=20)
    plt.show()
    return auc


import warnings

warnings.filterwarnings("ignore")

In [None]:
df_hol_min = df_hol.loc[df_hol["income"] == ">50K"]
print(
    f"Holdout data consists of {df_hol.shape[0]:,} records",
    f"with {df_hol_min.shape[0]:,} samples from the minority class",
)

### Train model on the original imbalanced training data

In [None]:
X_trn, y_trn = prepare_xy(trn)
model_trn = train_model(X_trn, y_trn)
auc_trn = evaluate_model(model_trn, df_hol)

With an AUC of about 60%, the model trained on the imbalanced dataset is just as good as a flip of a coin. I.e., the downstream LightGBM model is not able to learn any signal due to the low number of samples.

### Train model on naively rebalanced training data

In [None]:
from imblearn.over_sampling import RandomOverSampler

X_trn, y_trn = prepare_xy(trn)
sm = RandomOverSampler(random_state=1)
X_trn_up, y_trn_up = sm.fit_resample(X_trn, y_trn)
model_trn_up = train_model(X_trn_up, y_trn_up)
auc_trn_up = evaluate_model(model_trn_up, df_hol)

Random "naive" upsampling [[2](#refs)], which simply adds minority samples multiple times to achieve a balance, only marginally helps the downstream model in this case.

### Train model on SMOTE rebalanced training data

SMOTE upsampling [[3](#refs)], which creates novel (non-privacy-preserving) samples by interpolating between neighboring samples, does boost the performance of the downstream model to close to 80%.

In [None]:
from imblearn.over_sampling import SMOTENC

X_trn, y_trn = prepare_xy(trn)
categorical_mask = (X_trn.dtypes == "category").tolist()
categorical_features_indices = [i for i, is_categorical in enumerate(categorical_mask) if is_categorical]
sm = SMOTENC(categorical_features=categorical_features_indices, random_state=1)
X_trn_smote, y_trn_smote = sm.fit_resample(X_trn, y_trn)
model_trn_smote = train_model(X_trn_smote, y_trn_smote)
auc_trn_smote = evaluate_model(model_trn_smote, df_hol)

### Train model on balanced synthetic data

In [None]:
X_syn, y_syn = prepare_xy(syn)
model_syn = train_model(X_syn, y_syn)
auc_syn = evaluate_model(model_syn, df_hol)

Both, performance measures, the AUC [[4](#refs)] as well as the macro-averaged F1 score [[5](#refs)] are significantly better for the model that was trained on synthetic data, than if it were trained on any of the other methods. This is a strong proof of value of synthetic rebalancing for learning more about a small sub-group within the population.

## Conclusion

For the given dataset and the given synthesizer we can see, that both data analysts as well as AI engineers can learn more from a balanced synthetic dataset when compared to the imbalanced original dataset. Note, that the actual lift in performance may vary, depending on the dataset, the predictive task, and the chosen ML model.

## Further exercises

In addition to walking through the above instructions, we suggest..
* to repeat the experiments for different class imbalances - see the helper script at the bottom to create such experiments
* to repeat the experiments for different datasets, ML models, etc.

## References<a class="anchor" name="refs"></a>

1. https://archive.ics.uci.edu/ml/datasets/adult
1. https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html
1. https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html
1. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
1. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html