# Differentially Private Synthetic Data "Run

In this notebook, we demonstrate how a generator can be trained with differential privacy guarantees, and explore how the various settings can impact the data fidelity.

For further background and analysis see also [this blog post](https://mostly.ai/blog/differentially-private-synthetic-data-with-mostly-ai) on "_Differentially Private Synthetic Data with MOSTLY AI_".

In [None]:
%pip install -U mostlyai # or: pip install -U 'mostlyai[local]'

## Load Original Data

In [None]:
import pandas as pd

# fetch original data
df_original = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz")
df_original.head(5)

## Train Generators with and without Differential Privacy

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

Train a generator without DP until fully converged.

In [None]:
g_no_dp = mostly.train(
 config={
 "name": "US Census without DP - full",
 "tables": [
 {
 "name": "census",
 "data": df_original,
 }
 ],
 },
)

Train a generator without DP, but limited to 5 epochs.

In [None]:
g_no_dp_e5 = mostly.train(
 config={
 "name": "US Census without DP - 5 epochs",
 "tables": [
 {
 "name": "census",
 "data": df_original,
 "tabular_model_configuration": {
 "max_epochs": 5, # Limit training to 5 epochs.
 },
 }
 ],
 },
)

Train a generator with DP, keeping all defaults.

In [None]:
g_dp_A = mostly.train(
 config={
 "name": "Census with DP - 1.5 1",
 "tables": [
 {
 "name": "census",
 "data": df_original,
 "tabular_model_configuration": {
 "differential_privacy": {
 "max_epsilon": None, # Specifies the maximum allowable epsilon value. If the training process exceeds this threshold, it will be terminated early.
 "delta": 1e-5, # The delta value for differential privacy. It is the probability of the privacy guarantee not holding.
 "noise_multiplier": 1.5, # The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (How much noise to add).
 "max_grad_norm": 1.0, # The maximum norm of the per-sample gradients for training the model with differential privacy.
 },
 },
 }
 ],
 },
)

Train a generator with DP, using stricter configurations.

In [None]:
g_dp_B = mostly.train(
 config={
 "name": "Census with DP - 4 2",
 "tables": [
 {
 "name": "census",
 "data": df_original,
 "tabular_model_configuration": {
 "differential_privacy": {
 "max_epsilon": None,
 "delta": 1e-5,
 "noise_multiplier": 4.0, # increased compared to default
 "max_grad_norm": 2.0, # increased compared to default
 },
 },
 }
 ],
 },
)

## Compare Metrics across these Runs

In [None]:
generators = [g_no_dp, g_no_dp_e5, g_dp_A, g_dp_B]
for g in generators:
 # fetch final epsilon from message of last model checkpoint
 messages = pd.DataFrame(g.training.progress().steps[3].messages)
 final_msg = messages.loc[messages.is_checkpoint == 1, :].tail(1).to_dict("records")[0]
 final_time = final_msg.get("total_time")
 final_eps = final_msg.get("dp_eps") or "-"
 final_delta = final_msg.get("dp_delta") or "-"
 # print out stats
 print(
 f"# {g.name}\nAccuracy: {g.accuracy:.1%}\nRuntime: {final_time:.0f} secs\nDP Epsilon: {final_eps}\nDP Delta: {final_delta}\n"
 )

## Further exercises

In addition to walking through the above instructions, we suggest..
* to experiment with different DP settings
* to study the impact of the total size of the training data on final eps
* to evaluate the accuracy-privacy trade off also for other datasets

## Conclusion

This tutorial demonstrated how to train with and without differential privacy guarantees. Note: DP just provides additional mathematical guarantees for use cases that require these. However, given the other privacy mechanism in-built into the SDK, synthetic data can also without stricter DP guarantees be considered to be anonymous. See again [here](https://mostly.ai/blog/differentially-private-synthetic-data-with-mostly-ai) for a further discussion.