# Conditional Generation <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/conditional-generation/conditional-generation.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

In this tutorial, we show how to generate samples that are conditioned on specific values for a set of attributes. By that, we effectively create partially synthetic data, where synthetic attributes are randomly sampled given the context of a handful of pre-determined fixed attributes. Note, that the synthetic data is still statistically representative, but within the given context. The privacy of the overall dataset is then largely dependend on the privacy of the provided fixed attributes.

We will demonstrate conditional generation for two use cases:
1. First, we generate synthetic data for the UCI Adult Income, but will probe the model for an equal gender split and an uncorrelated income attribute. I.e., we will remove the gender income gap, and see how the other attributes will change accordingly.
1. Secondly, we create partially synthetic data for AirBnB listings in Manhattan, where the locations will then be actual locations, yet all other attributes are synthetic.

To perform either scenario, we will create a Seed table that contains all columns, that we want to hold fixed. Once, a Generator has been created, we can then proceed to provide the seed context with the fixed attributes to conditionally create a Synthetic Dataset.

Note, the same kind of conditional generation can also be performed for two-table setups. Once a two-table Generator is trained, one can simply provide a Seed context for the subject table. The non-fixed columns of the subject table and the entire linked table will then be conditionally generated.

## Use Case 1 - Rebalanced UCI Adult Income

For this use case, we will be using again the UCI Adult Income [[1](#refs)] dataset. We will want to condition the synthetic data generation on the `sex` and `income` columns.

### Create a Generator with MOSTLY AI

The code below will create a Generator using the MOSTLY AI Synthetic Data SDK.

In [None]:
%pip install -U mostlyai  # or: pip install -U 'mostlyai[local]'

In [None]:
import pandas as pd

# fetch original data
df = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz")
df

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

In [None]:
# train a generator on the original training data
g = mostly.train(
    config={
        "name": "Conditional Generation Tutorial Census",
        "tables": [{"name": "data", "data": df, "tabular_model_configuration": {"max_training_time": 1}}],
    }
)

### Generate Synthetic Dataset with MOSTLY AI

Let's create a dataframe, with specific values for the fixed attributes `sex` and `income`, and use that as a seed for generating a Synthetic Dataset. We will create a 50/50 split between `Male` and `Females`. And we will keep the share of low- and high-income earners constant, however randomizing between men and women, effectively removing the gender income gap.

In [None]:
import numpy as np

np.random.seed(1)

n = 48_842
p_inc = (df.income == ">50K").mean()
seed = pd.DataFrame(
    {
        "sex": np.random.choice(["Male", "Female"], n, p=[0.5, 0.5]),
        "income": np.random.choice(["<=50K", ">50K"], n, p=[1 - p_inc, p_inc]),
    }
)
display(seed)

In [None]:
# probe the generator for synthetic data with a seed
syn = mostly.probe(generator=g, seed=seed)
print(f"Created synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")

### Explore Synthetic Data

Show 10 randomly sampled synthetic records. Note, that you can execute the following cell multiple times, to see different samples.

You can see that the partially synthetic data consists of about half male and half female.

In [None]:
syn.sample(n=10)

Let's now compare the age distribution of records from the the original data vs. from the partially synthetic data. As we will see, the synthesized women are now significantly older, in order to meet the criteria of removing the gender income gap.

In a similar vein, you can now study other shifts in the distributions, that are the consequence of the provided seed data.

In [None]:
import matplotlib.pyplot as plt

plt.xlim(10, 95)
plt.title("Female Age Distribution")
plt.xlabel("Age")
df[df.sex == "Female"].age.plot.kde(color="black", bw_method=0.2)
syn[syn.sex == "Female"].age.plot.kde(color="#24db96", bw_method=0.2)
plt.legend({"original": "black", "synthetic": "#24db96"})
plt.show()

## Use Case 2 - Partially Synthetic Geo Data

For this use case, we will be using 2019 AirBnB listings [[2](#refs)] for Manhattan. The dataset consists of 48,895 records, and 10 mixed-type columns, with two of those representing the latitude and longitude of the listing. We will use this dataset to create synthetic attributes for all the actual locations, that were contained in the original.

### Pre-Processing

We will need to concatenate `latitude` and `longitude` together into a single column, as this is the format expected by MOSTLY AI, in order to improve its representation of geo information.

In this example we will not artifically create Seed data, but will use the concatenated `LAT_LONG` variable and the `neighbourhood` variable from the original data as a Seed dataframe.

In [None]:
# fetch original data
df_orig = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/airbnb/airbnb.csv.gz")
df_orig

In [None]:
df = df_orig.copy()

# concatenate latitude and longitude to "LAT, LONG" format
df["LAT_LONG"] = df["latitude"].astype(str) + ", " + df["longitude"].astype(str)
df = df.drop(columns=["latitude", "longitude"])

# define list of columns, on which we want to condition on
seed_cols = ["neighbourhood", "LAT_LONG"]

# create dataframe that will be used as seed
df_seed = df[seed_cols]
display(df_seed.head())

### Create Generator with MOSTLY AI

The code below will create a Generator using the MOSTLY AI Synthetic Data SDK. We use pre-processed AirBnB data and need to configure column `LAT_LONG` as encoding type `Latitude, Longitude`. In order to not wait too long for the Generator to be ready we are limiting the max. training time to 2 minutes which will already provide us with sufficient quality.

In [None]:
# Train a generator on the pre-processed AirBnB data
config = {
    "name": "Conditional Generation Tutorial AirBnB",
    "tables": [
        {
            "name": "AirBnB",
            "data": df,
            "columns": [
                {"name": "neighbourhood_group", "model_encoding_type": "TABULAR_CATEGORICAL"},
                {"name": "neighbourhood", "model_encoding_type": "TABULAR_CATEGORICAL"},
                {"name": "room_type", "model_encoding_type": "TABULAR_CATEGORICAL"},
                {"name": "price", "model_encoding_type": "TABULAR_NUMERIC_AUTO"},
                {"name": "minimum_nights", "model_encoding_type": "TABULAR_NUMERIC_AUTO"},
                {"name": "number_of_reviews", "model_encoding_type": "TABULAR_NUMERIC_AUTO"},
                {"name": "last_review", "model_encoding_type": "TABULAR_DATETIME"},
                {"name": "reviews_per_month", "model_encoding_type": "TABULAR_NUMERIC_AUTO"},
                {"name": "availability_365", "model_encoding_type": "TABULAR_NUMERIC_AUTO"},
                {"name": "LAT_LONG", "model_encoding_type": "TABULAR_LAT_LONG"},
            ],
            "tabular_model_configuration": {
                "max_training_time": 2,
            },
        }
    ],
}

g_airbnb = mostly.train(config=config)

### Generate Synthetic Dataset with MOSTLY AI

We can now generate a Synthetic Dataset with the Seed that we already previously created.

In [None]:
# generate a synthetic dataset with a seed
syn_partial = mostly.probe(generator=g_airbnb, seed=df_seed)
print(f"Created synthetic data with {syn_partial.shape[0]:,} records and {syn_partial.shape[1]:,} attributes")

### Explore Synthetic Data

Let's compare the price distribution across Manhatten. Note again, that while the locations in the partially synthetic data are actual locations, all other attributes, incl. the price per night, are randomly sampled by the generative model. Still, these prices are again statistically representative given the context, i.e. the location within Manhattan.

In [None]:
%%capture --no-display


def plot_manhatten(df, title):
    ax = df_orig.plot.scatter(
        x="longitude",
        y="latitude",
        s=0.1,
        alpha=1,
        color=np.log(df.price.clip(lower=50, upper=2_000)),
        cmap=plt.colormaps["YlOrRd"],
    )
    ax.set_aspect(1.3)
    ax.set_title(title)
    plt.show()


plot_manhatten(df_orig, "Original Data")
plot_manhatten(syn_partial, "Partially Synthetic Data")

Note, that you can also create a fully synthetic data, and will also yield statistically representative locations with their attributes. However, as these locations do not necessarily exist (e.g. they might end up in the Hudson River), the demonstrated approach allows you to combine the best of both worlds. 

## Conclusion

In this tutorial we walked throught the process of conditional generation to yield partially synthetic data. This allows you to probe the generative model with a specific context, whether that is hypothetical (use case 1) or real (use case 2), and gain corresponding insights for specific scenarios.

## Further exercises

In addition to walking through the above instructions, we suggest..
* to use a different set of fixed columns for the US Census dataset
* to generate a very large number of records for a fixed value set, e.g. create 1 million records of 48 year old female Professors
* to perform a fully synthetic dataset of the AirBnB Manhattan dataset

## References<a class="anchor" name="refs"></a>

1. https://archive.ics.uci.edu/ml/datasets/adult
1. http://insideairbnb.com/get-the-data
