{ "cells": [ { "cell_type": "markdown", "id": "3fd5150a-d33e-435b-ac47-3fdf15fb05c2", "metadata": {}, "source": [ "# Close Gaps in Your Data with Smart Imputation \"Run\n", "\n", "Dealing with datasets that contain missing values can be a challenge. This is especially so if the remaining non-missing values are not representative and thus provide a distorted, biased picture of the overall population.\n", "\n", "In this tutorial we demonstrate how MOSTLY AI can help to close such gaps in your data via \"Smart Imputation\". By generating a synthetic dataset that doest not contain any missing values, it is possible to create a complete and sound representation of the underlying population. With this smartly imputed synthetic dataset it is then straightforward to accurately analyze the population as if all values were present in the first place.\n", "\n", "For this tutorial, we will be using a modified version of the UCI Adult Income dataset, that itself stems from the 1994 American Community Survey by the US census bureau. This reduced dataset consists of 48,842 records and 10 mixed-type features. We will replace ~30% of the values for attribute `age` with missing values. We will do this randomly, but with a specified bias, so that we end up missing the age information particularly from the elder segments." ] }, { "cell_type": "markdown", "id": "27fadb9e-433c-43bf-b24a-e8cf1d0f5843", "metadata": {}, "source": [ "## Data Preparation for this Tutorial\n", "\n", "We start by artificially injecting missing values into the original data via the following code." ] }, { "cell_type": "code", "execution_count": null, "id": "cb5536d4", "metadata": {}, "outputs": [], "source": [ "%pip install -U mostlyai # or: pip install -U 'mostlyai[local]'" ] }, { "cell_type": "code", "execution_count": null, "id": "64397119-e477-44a8-8c7a-500e09560aea", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df = pd.read_csv(\"https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz\")\n", "\n", "\n", "def mask(prob, col=None, values=None):\n", " is_masked = np.random.uniform(size=df.shape[0]) < prob\n", " if col:\n", " is_masked = (is_masked) & (df[col].isin(values))\n", " df[\"age\"] = df[\"age\"].mask(is_masked)\n", "\n", "\n", "np.random.seed(123)\n", "mask(0.1, \"age\", [51 + i for i in range(20)])\n", "mask(0.2, \"age\", [71 + i for i in range(20)])\n", "mask(0.6, \"income\", [\">50K\"])\n", "mask(0.6, \"education\", [\"Doctorate\", \"Prof-school\", \"Masters\"])\n", "mask(0.6, \"marital_status\", [\"Widowed\", \"Divorced\"])\n", "mask(0.6, \"occupation\", [\"Exec-managerial\"])\n", "mask(0.6, \"workclass\", [\"Self-emp-inc\"])\n", "tgt = df\n", "print(f\"Created original data with missing values with {tgt.shape[0]:,} records and {tgt.shape[1]} attributes\")" ] }, { "cell_type": "code", "execution_count": null, "id": "1a650a59-5fff-40b2-a80c-1916f89937c8", "metadata": {}, "outputs": [], "source": [ "# let's show some samples\n", "tgt[[\"workclass\", \"education\", \"marital_status\", \"age\"]].sample(n=10, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "id": "993f679f-c596-47d7-9efc-441d668c1cfc", "metadata": {}, "outputs": [], "source": [ "# report share of missing values for column `age`\n", "print(f\"{tgt['age'].isna().mean():.1%} of values for column `age` are missing\")" ] }, { "cell_type": "code", "execution_count": null, "id": "cf86c704-50ec-434b-990c-87c2dfc706e9", "metadata": {}, "outputs": [], "source": [ "# plot distribution of column `age`\n", "import matplotlib.pyplot as plt\n", "\n", "tgt.age.plot(kind=\"kde\", label=\"Original Data (with missings)\", color=\"black\")\n", "_ = plt.legend(loc=\"upper right\")\n", "_ = plt.title(\"Age Distribution\")\n", "_ = plt.xlim(13, 90)\n", "_ = plt.ylim(0, None)" ] }, { "cell_type": "markdown", "id": "2ee9726e-4eb8-4176-9ca4-aa8df2cb2933", "metadata": {}, "source": [ "## Synthesize Data via MOSTLY AI\n", "\n", "The code below will automatically create a Generator using the MOSTLY AI Synthetic Data SDK. The we will use that Generator to create a Synthetic dataset with turned on Smart Imputation for the `age` column." ] }, { "cell_type": "code", "execution_count": null, "id": "125e6d793c1175cb", "metadata": {}, "outputs": [], "source": [ "from mostlyai.sdk import MostlyAI\n", "\n", "# initialize SDK\n", "mostly = MostlyAI()" ] }, { "cell_type": "code", "execution_count": null, "id": "aa76b44b-746a-4905-98c7-60a5f69bf713", "metadata": {}, "outputs": [], "source": [ "# train a generator on the original training data\n", "g = mostly.train(data=tgt, name=\"Smart Imputation Tutorial - Census\")" ] }, { "cell_type": "code", "execution_count": null, "id": "eb8ce2cb", "metadata": {}, "outputs": [], "source": [ "# generate synthetic data with imputed age column\n", "config = {\n", " \"name\": \"Smart Imputation Tutorial - Census\",\n", " \"tables\": [{\"name\": \"data\", \"configuration\": {\"imputation\": {\"columns\": [\"age\"]}}}],\n", "}\n", "sd = mostly.generate(g, config=config)\n", "syn = sd.data()\n", "print(f\"Created synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes\")" ] }, { "cell_type": "markdown", "id": "97058dc7-b791-4674-b028-9599384b9d7f", "metadata": {}, "source": [ "If you want to, you can now check the distribution based on the Model QA and Data QA reports. Download these via `sd.reports()`, or display inline via `sd.reports(display=True)`. As one can see, the distributiosn are faithfully learned, and also include the right share of missing values. The Data QA visualizes then the distributions of the delivered Synthetic dataset. And there we can see, that the share of missing values (`N/A`) has dropped to 0%, and that the distribution has been shifted towards older age buckets." ] }, { "cell_type": "code", "execution_count": null, "id": "4f0091bb", "metadata": {}, "outputs": [], "source": [ "sd.reports(display=True)" ] }, { "cell_type": "markdown", "id": "25c4ac9f-8eef-4706-b033-6ea2d4e1d941", "metadata": {}, "source": [ "## Analyze the results\n", "\n", "We can now explore the imputed synthetic data." ] }, { "cell_type": "code", "execution_count": null, "id": "2a221626-b992-413c-9391-237f083ede10", "metadata": {}, "outputs": [], "source": [ "# show some synthetic samples\n", "syn[[\"workclass\", \"education\", \"marital_status\", \"age\"]].sample(n=10, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "id": "7e9ea56f-d4b5-4cd7-b217-aa659e058e7c", "metadata": {}, "outputs": [], "source": [ "# report share of missing values for column `age`\n", "print(f\"{syn['age'].isna().mean():.1%} of values for column `age` are missing\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d470dabd-f1d8-4c37-ab24-6109d5ec380c", "metadata": {}, "outputs": [], "source": [ "# plot side-by-side\n", "import matplotlib.pyplot as plt\n", "\n", "tgt.age.plot(kind=\"kde\", label=\"Original Data (with missings)\", color=\"black\")\n", "syn.age.plot(kind=\"kde\", label=\"Synthetic Data (imputed)\", color=\"green\")\n", "_ = plt.title(\"Age Distribution\")\n", "_ = plt.legend(loc=\"upper right\")\n", "_ = plt.xlim(13, 90)\n", "_ = plt.ylim(0, None)" ] }, { "cell_type": "markdown", "id": "a7ef6885-fa86-41ef-a90e-bb2359372e99", "metadata": {}, "source": [ "As one can see, the imputed synthetic data does NOT contain any missing values anymore. But it's also apparent, that the synthetic age distribution is significantly distinct from the distribution of the non-missing values that were provided.\n", "\n", "So, let's then check, whether that new distribution is more representative of the ground truth, i.e. the underlying original age distribution." ] }, { "cell_type": "code", "execution_count": null, "id": "5b6876a7-05b6-4bb6-bca9-d89e4ee61a8c", "metadata": {}, "outputs": [], "source": [ "raw = pd.read_csv(\"https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz\")\n", "\n", "# plot side-by-side\n", "import matplotlib.pyplot as plt\n", "\n", "tgt.age.plot(kind=\"kde\", label=\"Original Data (with missings)\", color=\"black\")\n", "raw.age.plot(kind=\"kde\", label=\"Original Data (ground truth)\", color=\"red\")\n", "syn.age.plot(kind=\"kde\", label=\"Synthetic Data (imputed)\", color=\"green\")\n", "_ = plt.title(\"Age Distribution\")\n", "_ = plt.legend(loc=\"upper right\")\n", "_ = plt.xlim(13, 90)\n", "_ = plt.ylim(0, None)" ] }, { "cell_type": "markdown", "id": "8d81298d", "metadata": {}, "source": [ "## Imputing Missing Values in the Original Data" ] }, { "cell_type": "markdown", "id": "6bb46e12", "metadata": {}, "source": [ "If you only need to impute missing values in the original dataset, rather than generating privacy-preserving synthetic data, you can use conditional generation. In this case, the seed dataset consists of the rows where the age value is missing. To impute the missing age values, you can exclude the age column from the seed dataset and then use the imputation option to generate the missing values for that column.\n", "\n", "The MOSTLY AI generator provides a full distribution estimate for the missing values. When you perform imputation, you receive a distribution sample. To obtain a point estimate (i.e., the best estimate for age for a specific record), you can calculate a statistics, such as the median or average, based on the generated values. For example, you can generate 100 imputed values to obtain a distribution estimate, and then calculate the median or average of these values to get the point estimate for the age." ] }, { "cell_type": "code", "execution_count": 19, "id": "b0ccd4cb", "metadata": {}, "outputs": [], "source": [ "# create a seed dataset that is used as the condition for the imputation of missing age\n", "seed = tgt.loc[tgt[\"age\"].isna(), :].drop(columns=\"age\")" ] }, { "cell_type": "code", "execution_count": null, "id": "bd818ec8", "metadata": {}, "outputs": [], "source": [ "# probe generator for imputed age column 100x times\n", "import numpy as np\n", "\n", "age_imputed = []\n", "for i in range(100):\n", " if i % 10 == 0:\n", " print(i, \"of\", 100, \"imputations done\")\n", " config = {\n", " \"tables\": [\n", " {\n", " \"name\": \"data\",\n", " \"configuration\": {\n", " \"sample_seed_data\": seed,\n", " \"imputation\": {\"columns\": [\"age\"]},\n", " },\n", " }\n", " ],\n", " }\n", " syn = mostly.probe(g, config=config)\n", " age_imputed.append(syn.age.to_numpy())" ] }, { "cell_type": "code", "execution_count": 21, "id": "bd6d9f79", "metadata": {}, "outputs": [], "source": [ "# take a single imputed value (this is NOT a point estimate, but a distribution sample)\n", "syn_one = tgt.copy()\n", "syn_one.loc[seed.index, \"age\"] = age_imputed[0]\n", "\n", "# take the median imputed value\n", "syn_q50 = tgt.copy()\n", "syn_q50.loc[seed.index, \"age\"] = np.median(np.vstack(age_imputed), axis=0)\n", "\n", "# take the avg imputed value\n", "syn_avg = tgt.copy()\n", "syn_avg.loc[seed.index, \"age\"] = np.mean(np.vstack(age_imputed), axis=0)" ] }, { "cell_type": "code", "execution_count": null, "id": "7c0748b3", "metadata": {}, "outputs": [], "source": [ "tgt.age.plot(kind=\"kde\", label=\"Original Data (with missings)\", color=\"black\")\n", "raw.age.plot(kind=\"kde\", label=\"Original Data (ground truth)\", color=\"red\")\n", "syn_one.age.plot(kind=\"kde\", label=\"Synthetic Data (one)\", color=\"blue\")\n", "syn_q50.age.plot(kind=\"kde\", label=\"Synthetic Data (q50)\", color=\"lightgreen\")\n", "syn_avg.age.plot(kind=\"kde\", label=\"Synthetic Data (avg)\", color=\"darkgreen\")\n", "_ = plt.title(\"Age Distribution\")\n", "_ = plt.legend(loc=\"upper right\")\n", "_ = plt.xlim(13, 90)\n", "_ = plt.ylim(0, None)" ] }, { "cell_type": "markdown", "id": "3997dee2", "metadata": {}, "source": [ "Let’s examine the first missing value in the dataset, its actual value, the distribution estimate, and calculate the point estimate." ] }, { "cell_type": "code", "execution_count": null, "id": "39d52e73", "metadata": {}, "outputs": [], "source": [ "idx = tgt[tgt[\"age\"].isna()].index[0]\n", "raw.iloc[idx]" ] }, { "cell_type": "code", "execution_count": null, "id": "7d7fcb6e", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(10, 6))\n", "\n", "pd.Series(np.vstack(age_imputed)[:, idx], name=\"age\").plot(kind=\"kde\", label=\"Distribution estimate\", color=\"green\")\n", "\n", "ax.axvline(x=raw.iloc[idx][\"age\"], color=\"black\", linestyle=\"-\", label=\"Actual value\")\n", "ax.axvline(x=syn_q50.iloc[idx][\"age\"], color=\"lightgreen\", linestyle=\"--\", label=\"Point estimate: median\")\n", "ax.axvline(x=syn_avg.iloc[idx][\"age\"], color=\"darkgreen\", linestyle=\"--\", label=\"Point estimate: average\")\n", "ax.axvline(x=syn_one.iloc[idx][\"age\"], color=\"blue\", linestyle=\"-\", label=\"Distribution sample: imputation 1\")\n", "\n", "# Add labels and title\n", "ax.set_xlabel(\"age\")\n", "ax.set_ylabel(\"Density\")\n", "ax.set_title(\"Comparison of the Actual Value and Its Estimates\")\n", "\n", "# Show legend\n", "ax.legend()\n", "\n", "# Show the plot\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "804b7f3a-b107-4dac-8a19-fd2854d1ba88", "metadata": {}, "source": [ "## Conclusion\n", "\n", "As we can see, the smartly imputed synthetic data is perfectly able to recover the original, suppressed distribution! As an analyst you can proceed with the exploratory and descriptive analysis, as if the values were present in the first place.\n", "\n", "Additionally, you can use the tool as a distribution estimator for the missing values in your original dataset. Simply apply conditional generation with the original seed and impute the column(s) of interest.\n", "\n", "## Further Reading\n", "\n", "See also here for a benchmark of Smart Imputation with respect to other commonly used imputation techniques: https://mostly.ai/blog/smart-imputation-with-synthetic-data" ] }, { "cell_type": "code", "execution_count": null, "id": "991025db-bca7-4604-a5d9-811e4461c8d5", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 5 }