# Getting Started with the SDK  <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/quick-start/quick-start.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

In this notebook, we take our first steps with the SDK by training a basic single-table generator, to then probe it for new synthetic samples.

In [None]:
%pip install -U mostlyai  # or: pip install -U 'mostlyai[local]'

## Load Original Data

Fetch some original data that will be used for training the generator.

In [None]:
import pandas as pd

# fetch some original data
repo_url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev"
df_original = pd.read_csv(f"{repo_url}/census/census.csv.gz")
df_original

## Initialize the SDK



In [None]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI()

## Train a Generator

Train a synthetic data generator.

In [None]:
g = mostly.train(
    config={
        "name": "US Census Income",  # name of the generator
        "tables": [
            {
                "name": "census",
                "data": df_original,
                "tabular_model_configuration": {  # tabular model configuration (optional)
                    "max_training_time": 1,  # - limit training time (in minutes)
                    # model, max_epochs,,..        # further model configurations (optional)
                    # 'differential_privacy': {    # differential privacy configuration (optional)
                    #     'max_epsilon': 5.0,      # - max epsilon value, used as stopping criterion
                    #     'delta': 1e-5,           # - delta value
                    # }
                },
                # columns, keys, compute,..      # further table configurations (optional)
            }
        ],
    },
    start=True,  # start training immediately (default: True)
    wait=True,  # wait for completion (default: True)
)

# display the quality assurance report
g.reports(display=True)

## Generate Synthetic Data

Probe the trained generator for 100 representative synthetic samples.

In [None]:
df_samples = mostly.probe(g, size=100)
df_samples

Generate a larger scale representative synthetic dataset.

In [None]:
sd = mostly.generate(g, size=100_000)
df_synthetic = sd.data()
df_synthetic

Conditionally generate 1000 records of 24y old Mexicans.

In [None]:
df_seed = pd.DataFrame(
    {
        "age": [24] * 1_000,
        "native_country": ["Mexico"] * 1_000,
    }
)
# conditionally probe, based on provided seed
df_samples = mostly.probe(g, seed=df_seed)
df_samples

## Conclusion

This tutorial demonstrated the basic usage of the Synthetic Data SDK. You have successfully trained a generator from scratch, given the original data. And you have then used the generator to sample new records, according to your specifications.

See the other tutorials for further exercises.