# Synthetic Text Generation using Large Language Models from Hugging Face <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/synthetic-text-llm/synthetic-text-llm.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

In this notebook, we demonstrate how to synthesize free text columns using pre-trained large language models with billions of parameters from [Hugging Face](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending).

The usage of a GPU, with 24GB of memory or more, is strongly recommended for running this tutorial.

MOSTLY AI leverages parameter efficient fine-tuning [[1](#refs)], quantization [[2](#refs)], and activation checkpointing [[3](#refs)] to train large language models in a memory efficient manner. Therefore, even if using a single GPU with only 24GB of memory, you should be able to train LLMs with billions of parameters.

## Synthesize Data via MOSTLY AI

In [None]:
%pip install -U mostlyai  # or: pip install 'mostlyai[local-gpu]'

In [None]:
import pandas as pd

# fetch original data
df = pd.read_parquet("https://github.com/mostly-ai/public-demo-data/raw/dev/headlines/headlines.parquet")

# split into train and holdout sets, we will use the holdout set to evaluate the performance of the generator later in the tutorial
tgt = df.sample(frac=0.9, random_state=42)
hol = df.drop(tgt.index)
tgt[["headline", "category"]].head(5)

In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

In [None]:
# train a generator on the dataset

# specify any Hugging Face language model that fits on a single GPU such as Mistral, Llama, Qwen, and so forth
huggingface_model = "Qwen/Qwen2.5-1.5B"

# NOTE: some models are gated and require a HF_TOKEN environment variable to be set
# huggingface_model = "meta-llama/Llama-3.2-1B"
# import os
# os.environ["HF_TOKEN"] = "hf_..."  # only needed if you want to use Llama or other gated models

config = {
    "name": "Synthetic Text Tutorial",
    "tables": [
        {
            "name": "headlines",
            "data": tgt,
            "tabular_model_configuration": {
                "max_training_time": 1,  # the tabular model should anyways finish in less than 1 min
            },
            "language_model_configuration": {
                "max_training_time": 20,  # we recommend at least ~20 minutes if training on an A10G or similar GPU
                "model": huggingface_model,
            },
            "columns": [
                {"name": "category", "model_encoding_type": "TABULAR_CATEGORICAL"},
                {"name": "headline", "model_encoding_type": "LANGUAGE_TEXT"},
            ],
        }
    ],
}

# train a generator
g_headlines = mostly.train(config=config)

# note: the first time this is run, the LLM will be initially downloaded from Hugging Face, which may take some time depending on your connectivity

In [None]:
# generate a synthetic dataset
syn = mostly.generate(generator=g_headlines, size=5000).data()
print(f"Created synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")

## Explore Synthetic Text

Show 10 randomly sampled synthetic records. Note that you can execute the following cell multiple times to see different samples.

In [None]:
syn.sample(n=10)

Compare this to 10 randomly sampled original records.

In [None]:
tgt.sample(n=10)

### How does the synthetic data compare to the original data when taking categories into account?

Next we perform a sanity check to see if the synthetic data is similar to the original data when taking categories into account. We take each headline, and encode it using a sentence transformer. We then perform PCA dimensionality reduction on the resulting embeddings and visualize the first two principal components. If the synthetic data is similar to the original data, we should see that the synthetic data has similar distribution of categories as the original data.

In [None]:
# perform PCA dimensionality reduction and visualization
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

# get sentence embeddings for both datasets
model = SentenceTransformer("all-MiniLM-L6-v2")

# get the 5 most frequent categories
value_counts = syn["category"].value_counts()
selected_categories = value_counts.head(5).index

# filter datasets to only include most frequent categories
tgt_head = tgt[tgt["category"].isin(selected_categories)].sample(1000)
syn_head = syn[syn["category"].isin(selected_categories)].sample(1000)

tgt_features = model.encode(tgt_head["headline"].tolist())
syn_features = model.encode(syn_head["headline"].tolist())

# apply PCA to both datasets
pca = PCA(n_components=2, random_state=42)
combined = np.vstack((syn_features, tgt_features))
pca.fit(combined)  # important: fit PCA on the combined data
syn_pca = pca.transform(syn_features)
tgt_pca = pca.transform(tgt_features)

# plot the PCA results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
colors = plt.cm.rainbow(np.linspace(0, 1, len(selected_categories)))

# plot synthetic data
for category, color in zip(selected_categories, colors):
    mask = syn_head["category"] == category
    ax1.scatter(syn_pca[mask, 0], syn_pca[mask, 1], c=[color], label=category, alpha=0.6)
ax1.set_title("Synthetic Headlines")
ax1.legend()

# plot original data
for category, color in zip(selected_categories, colors):
    mask = tgt_head["category"] == category
    ax2.scatter(tgt_pca[mask, 0], tgt_pca[mask, 1], c=[color], label=category, alpha=0.6)
ax2.set_title("Original Headlines")
ax2.legend()

plt.suptitle(f"PCA visualization of headlines by category\n ({huggingface_model})")
plt.show()

### Train a classifier on the synthetic data, and evaluate its performance on the real (original) data.

At the beginning of this tutorial, we saved a holdout set of the original data `hol` which we did not use to train the generator. We will now use this holdout set to evaluate the performance of a classifier trained on the synthetic data, and then compare it to the performance of a classifier trained on the original data (without the holdout set) to see how much information is retained in the synthetic data.

In [None]:
# train simple classifier on the synthetic data, and evaluate its performance on the real data holdout
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# embed the synthetically generated data
syn_train_features = model.encode(syn_head["headline"].tolist())
syn_train_labels = syn_head["category"].tolist()

# filter holdout data to only include categories present in synthetic data
mask = hol["category"].isin(syn_train_labels)
hol_features = model.encode(hol[mask]["headline"].tolist())
hol_labels = hol[mask]["category"].tolist()

# train classifier on synthetic data
clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(syn_train_features, syn_train_labels)

# evaluate on holdout set
hol_pred = clf.predict(hol_features)
print("Trained on SYNTHETIC data, evaluated on an actual holdout:")
print(classification_report(hol_labels, hol_pred))

In [None]:
# for comparison, we train a simple classifier on the real data and evaluate its performance on the real data holdout
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# embed the real training data
tgt_features = model.encode(tgt_head["headline"].tolist())
tgt_labels = tgt_head["category"].tolist()

# train classifier on real data
clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(tgt_features, tgt_labels)

# evaluate on holdout set
hol_pred = clf.predict(hol_features)
print("Trained on ORIGINAL data, evaluated on an actual holdout:")
print(classification_report(hol_labels, hol_pred))

The synthetic data retains a lot of the structure of the original data, and the classifier trained on the synthetic data should perform similarly to the classifier trained on the original data.

## Conclusion
In this tutorial, we demonstrated how to train a large language model to generate synthetic text conditioned on synthetic tabular data using MOSTLY AI's SDK. We analyzed the generated texts and showed that the synthetic data was of similar structure as the original data. We then proceeded to train a classifier on the synthetic data, and evaluated its performance on the original data.

This feature allows the user to make use of the world knowledge encoded in large language models to generate synthetic text data, all while retaining the structure of the original data.

## Further exercises

In addition to walking through the above instructions, we suggest..
* try seeded generation, condition your LLM on different data
* try using a different dataset
* try multi-billion parameter models

## References<a class="anchor" name="refs"></a>

1. LoRA: Low-Rank Adaptation of Large Language Models, https://arxiv.org/abs/2106.09685
1. QLoRA: Efficient Finetuning of Quantized LLMs, https://arxiv.org/abs/2305.14314
1. Training Deep Nets with Sublinear Memory Cost, https://arxiv.org/abs/1604.06174