# Create Fair Synthetic Data "Run

Fairness in machine learning aims to ensure that algorithms and models treat individuals and groups equitably, without introducing or perpetuating bias. The objective is to prevent discrimination and address societal inequities, particularly concerning protected attributes such as race, gender, age, or ethnicity.

In this tutorial, we showcase how MOSTLY AI’s Fairness feature can help bridge fairness gaps in your data. By generating a fair synthetic dataset, downstream models trained on this data are empowered to produce fair and unbiased predictions.

For further background see also [this paper](https://arxiv.org/abs/2311.03000) on "_Strong statistical parity through fair synthetic data_".

## Data Preparation

Let's use the UCI Adult [1] dataset, consisting of 48,842 records across 14 attributes. There we can observe ~30% of men having a high income compared to only ~11% of women, resulting in a statistical parity difference of 0.2.

In [None]:
%pip install -U mostlyai # or: pip install -U 'mostlyai[local]'
%pip install matplotlib plotly scikit-learn lightgbm

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# fetch original data
df = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz")
df

In [None]:
# split 80/20 into training and holdout
trn, hol = train_test_split(df, test_size=0.2, random_state=42)
trn.reset_index(drop=True, inplace=True)
hol.reset_index(drop=True, inplace=True)

In [None]:
import matplotlib.pyplot as plt


def plot_income_by_gender(df, title):
 # Create a bar plot for the distribution of income for males and females
 plt.figure(figsize=(10, 6))
 income_gender_distribution = df.groupby(["sex", "income"]).size().unstack()
 income_gender_proportions = (
 income_gender_distribution.div(income_gender_distribution.sum(axis=1), axis=0) * 100
 ) # Convert to percentages

 # Customizing the plot
 ax = income_gender_proportions.plot(kind="bar", stacked=True, ax=plt.gca())
 ax.spines["top"].set_visible(False)
 ax.spines["right"].set_visible(False)

 # Adding title and labels
 plt.title(f"Distribution of Income by Gender - {title}", fontsize=16, weight="bold")
 plt.xlabel("Gender", fontsize=14)
 plt.ylabel("Share (%)", fontsize=14)
 plt.xticks(rotation=0, fontsize=12)
 plt.yticks(fontsize=12)
 plt.legend(title="Income", fontsize=12, title_fontsize=14, loc="upper right")

 # Adding data labels
 for bar_group in ax.containers:
 ax.bar_label(bar_group, fmt="%.1f%%", label_type="center", fontsize=10)

 plt.show()


plot_income_by_gender(df, title="Original")

## Synthesize Data via MOSTLY AI

The code below will automatically create a Generator using the MOSTLY AI Synthetic Data SDK. Then we will use that Generator to create both, Synthetic dataset and Fair Synthetic dataset with turned on Fairness feature for the target `income` column and sensitive `sex` column.

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

In [None]:
# train a generator on the original training data
g = mostly.train(data=trn, name="Fairness Tutorial")

In [None]:
# create a representative synthetic data that preserve the bias present in the original data
sd = mostly.generate(g, name="Fairness Tutorial - Representative Synthetic Data")
syn = sd.data()

In [None]:
# Define the fairness configuration
fairness_config = {
 "name": "Fairness Tutorial - Fair Synthetic Data",
 "tables": [
 {
 "name": "data",
 "configuration": {
 "fairness": {
 "target_column": "income", # define fairness target
 "sensitive_columns": ["sex"], # define sensitive columns
 }
 },
 }
 ],
}

# create fair synthetic data with mitigated bias
fair_sd = mostly.generate(g, config=fairness_config)
fair_syn = fair_sd.data()

You can now examine the distributions using the Model QA and Data QA reports. These reports can be downloaded via `sd.reports()` for synthetic data and `fair_sd.reports()` for fair synthetic data. The Model QA report evaluates the accuracy and privacy performance of the trained generative AI model, demonstrating that the distributions are faithfully learned, including the original proportions of high-income men and women. The Data QA report visualizes how the income distributions in the delivered Fair Synthetic dataset have been adjusted to mitigate statistical parity differences, ensuring fairness

In [None]:
print(sd.reports("representative-synthetic-data-reports.zip").absolute())
print(fair_sd.reports("fair-synthetic-data-reports.zip").absolute())

In [None]:
plot_income_by_gender(syn, "Representative Synthetic Data")

In [None]:
plot_income_by_gender(fair_syn, "Fair Synthetic Data")

Statistical parity difference is mitigated for the fair synthetic dataset, i.e. the proportion of females and mals with high income is comparable.

## Train a Downstream ML Model

We can compare the model prediction using downstream prediction model on the original, synthetic data and fair synthetic data.

In [None]:
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score

In [None]:
trn_fns = ["original", "synthetic", "fair synthetic"]

y_hol = (hol.pop("income") == ">50K").astype(int)
X_hol = hol

cat_cols = hol.columns[X_hol.dtypes == "object"].tolist()
num_cols = hol.columns[X_hol.dtypes == "int64"].tolist()
ct = ColumnTransformer([("c", OneHotEncoder(handle_unknown="ignore"), cat_cols), ("n", MinMaxScaler(), num_cols)])
model = LGBMClassifier(n_estimators=100)
pipe = Pipeline(steps=[("t", ct), ("m", model)])

res = []
predicted_probs = pd.DataFrame()
for trn_fn, trn in zip(trn_fns, [trn, syn, fair_syn]):
 y_trn = (trn.pop("income") == ">50K").astype(int)
 X_trn = trn
 pipe.fit(X_trn, y_trn)
 probs = pipe.predict_proba(X_hol)[:, 1]
 predicted_probs[trn_fn] = probs
 res.append(
 {
 "AUC": roc_auc_score(y_hol, probs),
 "Accuracy": accuracy_score(y_hol, probs > 0.5),
 "F1": f1_score(y_hol, probs > 0.5, average="macro"),
 "N": trn.shape[0],
 "fn": trn_fn,
 }
 )

predicted_probs["sex"] = hol["sex"]
predicted_probs["income"] = y_hol

In [None]:
# sort the results based on the model performance:
res_sort = pd.DataFrame(res, index=list(range(len(res))))
predicted_probs["sex"] = hol["sex"]
predicted_probs["income"] = y_hol
res_sort["SP mean difference"] = (res_sort["fn"]).map(
 predicted_probs.groupby(["sex"])[trn_fns].mean().diff().iloc[1, :]
)
res_sort.sort_values(by="SP mean difference", ascending=True)
sorting = res_sort["fn"]
res_sort

The model performance on synthetic data is comparable to that on the original data, with a similar statistical parity (SP) difference. While fair synthetic data successfully resolves the SP difference, it does so at the expense of downstream model performance, reflected in a decreased AUC.

In [None]:
import plotly.express as px

In [None]:
px.histogram(predicted_probs, x="original", color="sex", marginal="box", title="Prediction_original")

In [None]:
px.histogram(predicted_probs, x="synthetic", color="sex", marginal="box", title="Prediction_synthetic")

In [None]:
px.histogram(predicted_probs, x="fair synthetic", color="sex", marginal="box", title="Prediction_fair_synthetic")

To evaluate the downstream model's predictions, we analyze the distribution of the prediction probabilities. To get fair predictions at any chosen classification threshold, it is crucial that the prediction distributions for males and females are comparable. This is best assessed using box plots. In the original data (and consequently in the synthetic data), the probability distribution for females is shifted to the left, indicating that the model predicts high income with lower probability for females compared to males. However, when the predictor is trained on the fair synthetic data, the distributions for males and females become more aligned, indicating improved fairness in predictions.

## Conclusion

As we can see, the fair synthetic data mitigate the sex bias present in the original data. Moreover, the downstream model trained on fair synthetic data, specifically with respect to statistical parity, produces fair predictions even when inferring from real-world, biased data.

## Further Reading

* For a demo within the MOSTLY AI platform, please see https://www.youtube.com/watch?v=Uxq_1t2_NCk
* For theoretical background and further analysis, see https://arxiv.org/abs/2311.03000