{ "cells": [ { "cell_type": "markdown", "id": "b2381157-7e39-4fd1-8ede-4e81ac9ff2a8", "metadata": {}, "source": [ "# Differentially Private Synthetic Data \"Run\n", "\n", "In this notebook, we demonstrate how a generator can be trained with differential privacy guarantees, and explore how the various settings can impact the data fidelity.\n", "\n", "For further background and analysis see also [this blog post](https://mostly.ai/blog/differentially-private-synthetic-data-with-mostly-ai) on \"_Differentially Private Synthetic Data with MOSTLY AI_\"." ] }, { "cell_type": "code", "execution_count": null, "id": "8497f8f5-0137-4bd6-aef8-dd1683c12bb4", "metadata": {}, "outputs": [], "source": "%pip install -U mostlyai # or: pip install -U 'mostlyai[local]'" }, { "cell_type": "markdown", "id": "e42b8878-e345-406b-8ff2-500a87740906", "metadata": {}, "source": [ "## Load Original Data" ] }, { "cell_type": "code", "execution_count": null, "id": "b131403d-1b60-4f36-a712-a96fe3a526c4", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# fetch original data\n", "df_original = pd.read_csv(\"https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz\")\n", "df_original.head(5)" ] }, { "cell_type": "markdown", "id": "5c15f89f-0fa6-417b-8048-589be5f651f3", "metadata": {}, "source": [ "## Train Generators with and without Differential Privacy" ] }, { "cell_type": "code", "execution_count": null, "id": "c8c4a141-d2b5-491d-9406-dc5ae498402c", "metadata": {}, "outputs": [], "source": [ "from mostlyai.sdk import MostlyAI\n", "\n", "# initialize SDK\n", "mostly = MostlyAI()" ] }, { "cell_type": "markdown", "id": "6f40d343-4475-4d59-8bde-e8952f360ef0", "metadata": {}, "source": [ "Train a generator without DP until fully converged." ] }, { "cell_type": "code", "execution_count": null, "id": "d5b6d500-5bf5-4902-8bc5-828bc6864c3b", "metadata": {}, "outputs": [], "source": [ "g_no_dp = mostly.train(\n", " config={\n", " \"name\": \"US Census without DP - full\",\n", " \"tables\": [\n", " {\n", " \"name\": \"census\",\n", " \"data\": df_original,\n", " }\n", " ],\n", " },\n", ")" ] }, { "cell_type": "markdown", "id": "b6cd29a3-9dfe-4afc-8613-758ca2395886", "metadata": {}, "source": [ "Train a generator without DP, but limited to 5 epochs." ] }, { "cell_type": "code", "execution_count": null, "id": "004c60a8-a6cc-4bf6-b4e6-fe478591b078", "metadata": {}, "outputs": [], "source": [ "g_no_dp_e5 = mostly.train(\n", " config={\n", " \"name\": \"US Census without DP - 5 epochs\",\n", " \"tables\": [\n", " {\n", " \"name\": \"census\",\n", " \"data\": df_original,\n", " \"tabular_model_configuration\": {\n", " \"max_epochs\": 5, # Limit training to 5 epochs.\n", " },\n", " }\n", " ],\n", " },\n", ")" ] }, { "cell_type": "markdown", "id": "2e35b1f8-7abe-472b-a83a-1b05ac785225", "metadata": {}, "source": [ "Train a generator with DP, keeping all defaults." ] }, { "cell_type": "code", "execution_count": null, "id": "19ae87e6-7431-488c-8aed-b612fe6b88b1", "metadata": {}, "outputs": [], "source": [ "g_dp_A = mostly.train(\n", " config={\n", " \"name\": \"Census with DP - 1.5 1\",\n", " \"tables\": [\n", " {\n", " \"name\": \"census\",\n", " \"data\": df_original,\n", " \"tabular_model_configuration\": {\n", " \"differential_privacy\": {\n", " \"max_epsilon\": None, # Specifies the maximum allowable epsilon value. If the training process exceeds this threshold, it will be terminated early.\n", " \"delta\": 1e-5, # The delta value for differential privacy. It is the probability of the privacy guarantee not holding.\n", " \"noise_multiplier\": 1.5, # The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (How much noise to add).\n", " \"max_grad_norm\": 1.0, # The maximum norm of the per-sample gradients for training the model with differential privacy.\n", " },\n", " },\n", " }\n", " ],\n", " },\n", ")" ] }, { "cell_type": "markdown", "id": "b5d3c9cf-e712-443e-ad4d-62ad4c66aa54", "metadata": {}, "source": [ "Train a generator with DP, using stricter configurations." ] }, { "cell_type": "code", "execution_count": null, "id": "1b21af83-222d-405e-badc-2cbf27f838d9", "metadata": {}, "outputs": [], "source": [ "g_dp_B = mostly.train(\n", " config={\n", " \"name\": \"Census with DP - 4 2\",\n", " \"tables\": [\n", " {\n", " \"name\": \"census\",\n", " \"data\": df_original,\n", " \"tabular_model_configuration\": {\n", " \"differential_privacy\": {\n", " \"max_epsilon\": None,\n", " \"delta\": 1e-5,\n", " \"noise_multiplier\": 4.0, # increased compared to default\n", " \"max_grad_norm\": 2.0, # increased compared to default\n", " },\n", " },\n", " }\n", " ],\n", " },\n", ")" ] }, { "cell_type": "markdown", "id": "2bf1034b-956c-4e20-8a46-6e1e6b22b0f5", "metadata": {}, "source": [ "## Compare Metrics across these Runs" ] }, { "cell_type": "code", "execution_count": null, "id": "3bbaf929-0149-411e-b857-1de68eead7c4", "metadata": {}, "outputs": [], "source": [ "generators = [g_no_dp, g_no_dp_e5, g_dp_A, g_dp_B]\n", "for g in generators:\n", " # fetch final epsilon from message of last model checkpoint\n", " messages = pd.DataFrame(g.training.progress().steps[3].messages)\n", " final_msg = messages.loc[messages.is_checkpoint == 1, :].tail(1).to_dict(\"records\")[0]\n", " final_time = final_msg.get(\"total_time\")\n", " final_eps = final_msg.get(\"dp_eps\") or \"-\"\n", " final_delta = final_msg.get(\"dp_delta\") or \"-\"\n", " # print out stats\n", " print(\n", " f\"# {g.name}\\nAccuracy: {g.accuracy:.1%}\\nRuntime: {final_time:.0f} secs\\nDP Epsilon: {final_eps}\\nDP Delta: {final_delta}\\n\"\n", " )" ] }, { "cell_type": "markdown", "id": "16e5b61e-6706-4787-b2d4-70a78f9c83ad", "metadata": {}, "source": [ "## Further exercises\n", "\n", "In addition to walking through the above instructions, we suggest..\n", "* to experiment with different DP settings\n", "* to study the impact of the total size of the training data on final eps\n", "* to evaluate the accuracy-privacy trade off also for other datasets" ] }, { "cell_type": "markdown", "id": "b876316b-62c6-4ddf-b84c-dd8cba517902", "metadata": {}, "source": [ "## Conclusion\n", "\n", "This tutorial demonstrated how to train with and without differential privacy guarantees. Note: DP just provides additional mathematical guarantees for use cases that require these. However, given the other privacy mechanism in-built into the SDK, synthetic data can also without stricter DP guarantees be considered to be anonymous. See again [here](https://mostly.ai/blog/differentially-private-synthetic-data-with-mostly-ai) for a further discussion." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 5 }