# Synthetic Text Generation using a fast LSTM model trained from scratch "Run

In this notebook, we demonstrate how to synthesize free text columns, and will furthermore explore its quality.

Note, that the usage of a GPU, with 24GB or more, is strongly recommended for running this tutorial.

For further background see also [this blog post](https://mostly.ai/blog/synthetic-data-for-text-annotation/) on "How To Scale Up Your Text Annotation Initiatives with Synthetic Text". We will be using a trimmed down version of a dataset containing AirBnB listings in London. This dataset can be downloaded in our public data repository [here](https://github.com/mostly-ai/public-demo-data/raw/dev/airbnb/london.csv.gz). The original data was downloaded from [Inside AirBnB](http://insideairbnb.com/get-the-data).

## Synthesize Data via MOSTLY AI

In [1]:
%pip install -U mostlyai # or: pip install 'mostlyai[local-gpu]'

In [None]:
import pandas as pd

# fetch original data
tgt = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/airbnb/london.csv.gz")
tgt

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

In [None]:
# Train a generator on the pre-processed AirBnB data
config = {
 "name": "Synthetic Text Tutorial AirBnB",
 "tables": [
 {
 "name": "airBnB",
 "data": tgt,
 "tabular_model_configuration": {
 "max_training_time": 1, # the tabular model should anyways finish in less than 1 min
 },
 "language_model_configuration": {
 "max_training_time": 20, # we recommend at least ~20 minutes if training on an A10G or similar GPU
 },
 "columns": [
 {"name": "host_name", "model_encoding_type": "LANGUAGE_TEXT"},
 {"name": "title", "model_encoding_type": "LANGUAGE_TEXT"},
 {"name": "property_type", "model_encoding_type": "TABULAR_CATEGORICAL"},
 {"name": "room_type", "model_encoding_type": "TABULAR_CATEGORICAL"},
 {"name": "neighbourhood", "model_encoding_type": "TABULAR_CATEGORICAL"},
 {"name": "price", "model_encoding_type": "TABULAR_NUMERIC_AUTO"},
 ],
 }
 ],
}

g_airbnb = mostly.train(config=config)

In [None]:
# generate a synthetic dataset
syn = mostly.generate(generator=g_airbnb, size=1_000).data()
print(f"Created synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")

## Explore Synthetic Text

Show 10 randomly sampled synthetic records. Note, that you can execute the following cell multiple times, to see different samples.

In [None]:
syn.sample(n=10)

Compare this to 10 randomly sampled original records.

In [None]:
tgt.sample(n=10)

### Inspect Character Set

You will note, that the character set of the synthetic data is shorter. This is due to the privacy mechanism within the MOSTLY AI platform, where very rare tokens are being removed, to prevent that their presence give away information on the existence of individual records.

In [None]:
# Concatenate 'title' strings and remove duplicates by converting to a set, then back to a list
tgt_chars = "".join(sorted(list(set(tgt["title"].str.cat(sep=" ")))))
syn_chars = "".join(sorted(list(set(syn["title"].str.cat(sep=" ")))))

# Display the concatenated strings and their lengths
print("## ORIGINAL ##\n", tgt_chars, "\n")
print("Length of ORIGINAL characters:", len(tgt_chars), "\n")
print("## SYNTHETIC ##\n", syn_chars, "\n")
print("Length of SYNTHETIC characters:", len(syn_chars), "\n")

### Inspect Character Frequency

In [None]:
# Get character frequencies for 'tgt'
tgt_chars = tgt["title"].str.split("").explode()
tgt_freq = tgt_chars.value_counts(normalize=True).rename_axis("char").reset_index(name="tgt")

# Get character frequencies for 'syn'
syn_chars = syn["title"].str.split("").explode()
syn_freq = syn_chars.value_counts(normalize=True).rename_axis("char").reset_index(name="syn")

# Merge the frequencies and sort
title_char_freq = pd.merge(tgt_freq, syn_freq, on="char", how="outer").round(5)
title_char_freq.sort_values(by="tgt", ascending=False, inplace=True)

# Display the frequencies
title_char_freq.head(10)

In [None]:
import matplotlib.pyplot as plt

# Set 'char' column as the index
title_char_freq_indexed = title_char_freq.set_index("char")

# Plot the first 50 characters using the new index
ax = title_char_freq_indexed.head(50).plot.line()
plt.title("Distribution of 50 most common characters")

# Set x-axis labels with no rotation for better readability
plt.xticks(
 ticks=range(len(title_char_freq_indexed.head(50))), labels=title_char_freq_indexed.head(50).index, rotation=0
)

plt.xlabel("Character")
plt.ylabel("Frequency")

plt.show()

We can see that Character Frequencies are perfectly retained.

### Inspect Term Frequency

In [None]:
import pandas as pd
import re


def sanitize(s):
 s = str(s).lower()
 s = re.sub('[\\,\\.\\)\\(\\!\\"\\:\\/]', " ", s)
 s = re.sub("[ ]+", " ", s)
 return s


# Apply the sanitize function and split the titles into terms
tgt["terms"] = tgt["title"].apply(lambda x: sanitize(x)).str.split(" ")
syn["terms"] = syn["title"].apply(lambda x: sanitize(x)).str.split(" ")

# Explode 'terms' and create a DataFrame with explicit column names before applying value_counts
tgt_terms_df = tgt["terms"].explode().to_frame(name="term")
syn_terms_df = syn["terms"].explode().to_frame(name="term")

# Calculate the normalized value counts and reset the index
tgt_freq = tgt_terms_df["term"].value_counts(normalize=True).reset_index(name="tgt").rename(columns={"index": "term"})
syn_freq = syn_terms_df["term"].value_counts(normalize=True).reset_index(name="syn").rename(columns={"index": "term"})

# Merge the frequencies and sort by 'tgt' in descending order
title_term_freq = pd.merge(tgt_freq, syn_freq, on="term", how="outer").round(5)
title_term_freq = title_term_freq.sort_values(by="tgt", ascending=False)

# Display the top and bottom rows
display(title_term_freq.head(10))
display(title_term_freq.head(200).tail(10))

In [None]:
# Set 'term' column as the index
title_term_freq_indexed = title_term_freq.set_index("term")

# Plot the first 50 terms using the new index
ax = title_term_freq_indexed.head(50).plot.line()
plt.title("Distribution of 50 most common terms")

# Set x-axis labels with a 90-degree rotation for better readability
plt.xticks(
 ticks=range(len(title_term_freq_indexed.head(50))), labels=title_term_freq_indexed.head(50).index, rotation=90
)

plt.xlabel("Term")
plt.ylabel("Frequency")

plt.show()

We can see that Term Frequencies are perfectly retained.

### Inspect Term Co-occurrence

In [None]:
def calc_conditional_probability(term1, term2):
 # Ensure no NaN values in 'title' before applying str.contains
 tgt_beds = tgt["title"].fillna("").str.lower().str.contains(term1)
 syn_beds = syn["title"].fillna("").str.lower().str.contains(term1)

 # Use the boolean Series to filter 'title' containing term1 and then check for term2
 tgt_beds_double = tgt["title"][tgt_beds].str.lower().str.contains(term2).mean()
 syn_beds_double = syn["title"][syn_beds].str.lower().str.contains(term2).mean()

 print(f"{tgt_beds_double:.0%} of original Listings, that contain `{term1}`, also contain `{term2}`")
 print(f"{syn_beds_double:.0%} of synthetic Listings, that contain `{term1}`, also contain `{term2}`")
 print("")


calc_conditional_probability("bed", "double")
calc_conditional_probability("bed", "king")
calc_conditional_probability("heart", "london")
calc_conditional_probability("london", "heart")

We can see that Term Co-occurrences are almost perfectly retained.

Now you might be asking yourself: if all of these characteristics are maintained, what are the chances that we'll end up with exact matches, i.e. synthetic records with the exact same `title` value as a record in the original dataset? Or even a synthetic record with the exact same values for all the columns?

Let's start by trying to find an exact match for 1 specific synthetic `title` value:

In [None]:
# find exact match for 1 specific synthetic title value. Copy a `title` value from a synthetic record into the `title_value` field below and run the cell to find an exact match in the original dataset
title_value = "Airy large double room"
tgt.loc[tgt["title"].str.contains(title_value, case=False, na=False)]

Depending on your chosen value, you may or may not find an exact match. This row-by-row validation process doesn't indicate very much and, more importantly, doesn't scale very well to the 71K rows in the dataset.

### Inspect Privacy via Exact Matches

Let's perform a simplified check for privacy, by looking for exact matches between the synthetic and the original.

For that we first split the original data into two equally-sized sets, and measure the number of matches between those two sets.

In [None]:
n = int(tgt.shape[0] / 2)
pd.merge(tgt[["title"]][:n].drop_duplicates(), tgt[["title"]][n:].drop_duplicates())

Next, we take an equally-sized subset of the synthetic data, and again measure the number of matches between that set and the original data.

In [None]:
pd.merge(tgt[["title"]][:n].drop_duplicates(), syn[["title"]][:n].drop_duplicates())

We can see that exact matches between original and synthetic data can occur. However, they occur only for the most commonly used descriptions, and they do not occur more often than they occur in the original data itself.

Thus, it's important to note, that matchinig values or matching complete records are by themselves not a sign of privacy leak. They are only an issue if they occur more frequently than we would expect based on the original dataset. Also note that removing those exact matches via post-processing would have a detrimental contrary effect. The absence of a value like "Lovely single room" in a sufficiently large synthetic text corpus would in this case actually give away the fact that this sentence was present in the original. See [[1](#refs)] respectively [[2](#refs)] for more background info on this aspect.

### Analyze Price vs. Text correlation

In [None]:
tgt_term_price = tgt[["terms", "price"]].explode(column="terms").groupby("terms")["price"].median()
syn_term_price = syn[["terms", "price"]].explode(column="terms").groupby("terms")["price"].median()


def print_term_price(term):
 print(f"Median Price of original Listings, that contain `{term}`: ${tgt_term_price[term]:.0f}")
 print(f"Median Price of synthetic Listings, that contain `{term}`: ${syn_term_price[term]:.0f}")
 print("")


print_term_price("luxury")
print_term_price("stylish")
print_term_price("cozy")
print_term_price("small")

We can see that correlations between Term occurence and the price per night, are also very well retained.

## Conclusion

This tutorial demonstrated how synthetic text can be generated wihtin the context of an otherwise structured dataset. We analyzed the generated texts, and validated that characters and terms occur with the same frequency, while exact matches do not occur anymore likely than within the actual text itself.

This feature thus allows to retain valuable statistical insights, typically burried away in free text columns, that remain inaccessible due to their privacy sensitive nature.

## Further exercises

In addition to walking through the above instructions, we suggest..
* analyzing further correlations, also for `host_name`
* using a different generation mood, eg. conservative sampling
* using a different dataset, eg. the Austrian First Name [[3](#refs)]

## References

1. https://github.com/mostly-ai/public-demo-data/blob/dev/firstnames_at/firstnames_at.csv.gz
1. https://www.frontiersin.org/articles/10.3389/fdata.2021.679939/full
1. https://mostly.ai/blog/truly-anonymous-synthetic-data-legal-definitions-part-ii/