{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b2381157-7e39-4fd1-8ede-4e81ac9ff2a8",
   "metadata": {},
   "source": [
    "# Differentially Private Synthetic Data  <a href=\"https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/differential-privacy/differential-privacy.ipynb\" target=\"_blank\"><img src=\"https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab\" alt=\"Run on Colab\"></a>\n",
    "\n",
    "In this notebook, we demonstrate how a generator can be trained with differential privacy guarantees, and explore how the various settings can impact the data fidelity.\n",
    "\n",
    "For further background and analysis see also [this blog post](https://mostly.ai/blog/differentially-private-synthetic-data-with-mostly-ai) on \"_Differentially Private Synthetic Data with MOSTLY AI_\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8497f8f5-0137-4bd6-aef8-dd1683c12bb4",
   "metadata": {},
   "outputs": [],
   "source": "%pip install -U mostlyai  # or: pip install -U 'mostlyai[local]'"
  },
  {
   "cell_type": "markdown",
   "id": "e42b8878-e345-406b-8ff2-500a87740906",
   "metadata": {},
   "source": [
    "## Load Original Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b131403d-1b60-4f36-a712-a96fe3a526c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# fetch original data\n",
    "df_original = pd.read_csv(\"https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz\")\n",
    "df_original.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c15f89f-0fa6-417b-8048-589be5f651f3",
   "metadata": {},
   "source": [
    "## Train Generators with and without Differential Privacy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8c4a141-d2b5-491d-9406-dc5ae498402c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from mostlyai.sdk import MostlyAI\n",
    "\n",
    "# initialize SDK\n",
    "mostly = MostlyAI()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f40d343-4475-4d59-8bde-e8952f360ef0",
   "metadata": {},
   "source": [
    "Train a generator without DP until fully converged."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5b6d500-5bf5-4902-8bc5-828bc6864c3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "g_no_dp = mostly.train(\n",
    "    config={\n",
    "        \"name\": \"US Census without DP - full\",\n",
    "        \"tables\": [\n",
    "            {\n",
    "                \"name\": \"census\",\n",
    "                \"data\": df_original,\n",
    "            }\n",
    "        ],\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6cd29a3-9dfe-4afc-8613-758ca2395886",
   "metadata": {},
   "source": [
    "Train a generator without DP, but limited to 5 epochs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "004c60a8-a6cc-4bf6-b4e6-fe478591b078",
   "metadata": {},
   "outputs": [],
   "source": [
    "g_no_dp_e5 = mostly.train(\n",
    "    config={\n",
    "        \"name\": \"US Census without DP - 5 epochs\",\n",
    "        \"tables\": [\n",
    "            {\n",
    "                \"name\": \"census\",\n",
    "                \"data\": df_original,\n",
    "                \"tabular_model_configuration\": {\n",
    "                    \"max_epochs\": 5,  # Limit training to 5 epochs.\n",
    "                },\n",
    "            }\n",
    "        ],\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e35b1f8-7abe-472b-a83a-1b05ac785225",
   "metadata": {},
   "source": [
    "Train a generator with DP, keeping all defaults."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19ae87e6-7431-488c-8aed-b612fe6b88b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "g_dp_A = mostly.train(\n",
    "    config={\n",
    "        \"name\": \"Census with DP - 1.5 1\",\n",
    "        \"tables\": [\n",
    "            {\n",
    "                \"name\": \"census\",\n",
    "                \"data\": df_original,\n",
    "                \"tabular_model_configuration\": {\n",
    "                    \"differential_privacy\": {\n",
    "                        \"max_epsilon\": None,  # Specifies the maximum allowable epsilon value. If the training process exceeds this threshold, it will be terminated early.\n",
    "                        \"delta\": 1e-5,  # The delta value for differential privacy. It is the probability of the privacy guarantee not holding.\n",
    "                        \"noise_multiplier\": 1.5,  # The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (How much noise to add).\n",
    "                        \"max_grad_norm\": 1.0,  # The maximum norm of the per-sample gradients for training the model with differential privacy.\n",
    "                    },\n",
    "                },\n",
    "            }\n",
    "        ],\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5d3c9cf-e712-443e-ad4d-62ad4c66aa54",
   "metadata": {},
   "source": [
    "Train a generator with DP, using stricter configurations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b21af83-222d-405e-badc-2cbf27f838d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "g_dp_B = mostly.train(\n",
    "    config={\n",
    "        \"name\": \"Census with DP - 4 2\",\n",
    "        \"tables\": [\n",
    "            {\n",
    "                \"name\": \"census\",\n",
    "                \"data\": df_original,\n",
    "                \"tabular_model_configuration\": {\n",
    "                    \"differential_privacy\": {\n",
    "                        \"max_epsilon\": None,\n",
    "                        \"delta\": 1e-5,\n",
    "                        \"noise_multiplier\": 4.0,  # increased compared to default\n",
    "                        \"max_grad_norm\": 2.0,  # increased compared to default\n",
    "                    },\n",
    "                },\n",
    "            }\n",
    "        ],\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bf1034b-956c-4e20-8a46-6e1e6b22b0f5",
   "metadata": {},
   "source": [
    "## Compare Metrics across these Runs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3bbaf929-0149-411e-b857-1de68eead7c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "generators = [g_no_dp, g_no_dp_e5, g_dp_A, g_dp_B]\n",
    "for g in generators:\n",
    "    # fetch final epsilon from message of last model checkpoint\n",
    "    messages = pd.DataFrame(g.training.progress().steps[3].messages)\n",
    "    final_msg = messages.loc[messages.is_checkpoint == 1, :].tail(1).to_dict(\"records\")[0]\n",
    "    final_time = final_msg.get(\"total_time\")\n",
    "    final_eps = final_msg.get(\"dp_eps\") or \"-\"\n",
    "    final_delta = final_msg.get(\"dp_delta\") or \"-\"\n",
    "    # print out stats\n",
    "    print(\n",
    "        f\"# {g.name}\\nAccuracy:   {g.accuracy:.1%}\\nRuntime:    {final_time:.0f} secs\\nDP Epsilon: {final_eps}\\nDP Delta:   {final_delta}\\n\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16e5b61e-6706-4787-b2d4-70a78f9c83ad",
   "metadata": {},
   "source": [
    "## Further exercises\n",
    "\n",
    "In addition to walking through the above instructions, we suggest..\n",
    "* to experiment with different DP settings\n",
    "* to study the impact of the total size of the training data on final eps\n",
    "* to evaluate the accuracy-privacy trade off also for other datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b876316b-62c6-4ddf-b84c-dd8cba517902",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "This tutorial demonstrated how to train with and without differential privacy guarantees. Note: DP just provides additional mathematical guarantees for use cases that require these. However, given the other privacy mechanism in-built into the SDK, synthetic data can also without stricter DP guarantees be considered to be anonymous. See again [here](https://mostly.ai/blog/differentially-private-synthetic-data-with-mostly-ai) for a further discussion."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}