# MOSTLY AI vs. SDV Comparison - Single Table Scenario <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/sdv-comparison/single-table-scenario/single-table-scenario.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

## Framework Comparison
This notebook compares two synthetic data generation libraries on a large-scale dataset:

- **SDV (Synthetic Data Vault)** - Business Source License
- **MOSTLY AI SDK** - Apache 2.0 License - Open Source

## Dataset & Objective
We'll use the **US Census Income dataset (10M records)** to:
- Compare training performance and generation speed
- Evaluate synthetic data quality using comprehensive metrics
- Assess privacy preservation capabilities
- Provide practical guidance for framework selection

## Key Takeaways
- Performance benchmarks on large-scale data
- Quality comparison metrics
- Privacy assessment results

In [None]:
# Install SDK in CLIENT mode
!uv pip install -U mostlyai
# Or install in LOCAL mode
!uv pip install -U 'mostlyai[local]'  
# Note: Restart kernel session after installation!

!uv pip install -q scikit-learn seaborn lightgbm sdv

# 1. Data Preparation

## Loading the Dataset
We'll use the US Census Income dataset with 10M records containing demographic, employment, and financial information - ideal for testing synthetic data generation at scale.

In [None]:
import pandas as pd

# Load the ACS Income dataset (1.4M records) from remote Parquet file
# Note: This is a large dataset - initial load may take a while
data = pd.read_parquet(
    "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/census/acs-income-2018.parquet"
).iloc[:, :15]
# drop unused categorical labels, so that SDV does not crash
for col in data.select_dtypes(["category"]).columns:
    data[col] = data[col].cat.remove_unused_categories()

# Display basic info about the dataset
print(f"Dataset shape: {data.shape}")
print(f"Memory usage: {data.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

data.head()

## Dataset Overview

The dataset contains 15 columns with mixed data types. This combination of numerical and categorical data makes it ideal for testing both frameworks' capabilities.

In [None]:
# Display column names and basic data types
print("\nColumns:")
print(data.dtypes)

## Train/Holdout Split

We split the data into:
- **Training Set (80% - 1.2M records)**: For model training
- **Holdout Set (20% - 0.3M records)**: For quality evaluation

This split ensures we can properly assess synthetic data quality against unseen real data.

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training/holdout sets
# Using stratified split would be better for classification tasks, but not critical here
# random_state=1 ensures reproducible results
train, holdout = train_test_split(
    data,
    test_size=0.2,  # 20% for holdout evaluation
    random_state=1,  # Fixed seed for reproducibility
    shuffle=True,  # Ensure random sampling
)

print(f"Training set: {train.shape[0]:,} records ({train.shape[0] / len(data) * 100:.1f}%)")
print(f"Holdout set:  {holdout.shape[0]:,} records ({holdout.shape[0] / len(data) * 100:.1f}%)")

# 2. SDV Metadata Configuration

## Metadata Setup
SDV requires metadata to understand your data structure. We'll use auto-detection to identify column types (numerical vs categorical), then validate the configuration.

## Auto-Detecting Metadata
SDV can automatically detect column types from the data. The auto-detection correctly identifies our numerical and categorical columns.

In [None]:
from sdv.metadata import Metadata

# Auto-detect metadata from the training data
# Note: We wrap the DataFrame in a dict with table name 'table' as required by SDV
# Using only training data to avoid data leakage
metadata = Metadata.detect_from_dataframes({"table": train})

# Show a summary of detected column types
table_metadata = metadata.to_dict()["tables"]["table"]["columns"]
numerical_cols = [col for col, info in table_metadata.items() if info["sdtype"] == "numerical"]
categorical_cols = [col for col, info in table_metadata.items() if info["sdtype"] == "categorical"]

print("\nüìä Metadata Summary:")
print(f"Numerical columns ({len(numerical_cols)}): {numerical_cols}")
print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")

# Validate the metadata structure
try:
    metadata.validate()
    print("‚úÖ Metadata validation passed")
except Exception as e:
    print(f"‚ùå Metadata validation failed: {e}")
    # You would fix metadata issues here if any exist

# Validate that the metadata matches the actual data structure
try:
    metadata.validate_data(data=({"table": train}))  # Use train data for consistency
    print("‚úÖ Data validation against metadata passed")
except Exception as e:
    print(f"‚ùå Data validation failed: {e}")
    # This would indicate mismatches between metadata and actual data

# 3. SDV: Training and Generation

## Gaussian Copula Synthesizer
We'll use SDV's Gaussian Copula Synthesizer, which models the statistical relationships between variables and generates synthetic data that preserves these relationships.

In [None]:
import time

from sdv.single_table import GaussianCopulaSynthesizer

# Initialize the synthesizer with our metadata
# GaussianCopula is good for mixed data types and preserving correlations
synthesizer = GaussianCopulaSynthesizer(metadata)

print("üöÄ Starting SDV training...")
print(f"Training on {len(train):,} records with {len(train.columns)} features")

start_time = time.time()

# Train the synthesizer on our training data
# This learns the statistical relationships between variables
synthesizer.fit(train)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"‚úÖ SDV training completed in {elapsed_minutes:.2f} minutes")

In [None]:
print("üé≤ Starting SDV synthetic data generation...")

start_time = time.time()

# Generate synthetic data with the same number of rows as original dataset
# You can adjust num_rows based on your needs
target_rows = len(data)  # Generate same size as original
sdv_synthetic_data = synthesizer.sample(num_rows=target_rows)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"‚úÖ SDV generation completed in {elapsed_minutes:.2f} minutes")
print(f"‚è±Ô∏è  Generation rate: {target_rows / (end_time - start_time):,.0f} records/second")
print(f"üìä Generated {len(sdv_synthetic_data):,} synthetic records")

# Quick preview of generated data
print("\nFirst 5 synthetic records:")
print(sdv_synthetic_data.head())

In [None]:
import os

# Save SDV synthetic data
output_file = "./data/sdv_synthetic_data.parquet"
sdv_synthetic_data.to_parquet(output_file, index=False)

# Get file size in MB
file_size_mb = os.path.getsize(output_file) / 1024**2

print(f"üíæ SDV synthetic data saved to: {output_file}")
print(f"üìÅ File size: {file_size_mb:.1f} MB")

# 4. Mostly AI: Training and Generation

## Deep Learning Approach
Mostly AI uses advanced deep learning models optimized for tabular data. The SDK provides local training capabilities with configurable parameters for training time and privacy settings.

In [None]:
from mostlyai.sdk import MostlyAI

# Initialize Mostly AI SDK for local training
# local=True means we'll train models locally rather than using cloud API
print("üîß Initializing Mostly AI SDK...")
mostly = MostlyAI(local=True)
print("‚úÖ Mostly AI SDK initialized successfully")

In [None]:
print("üöÄ Starting Mostly AI training...")
print(f"Training on {len(train):,} records with {len(train.columns)} features")

start_time = time.time()

# Configure and start training
# Mostly AI automatically detects column types and optimizes model architecture
g = mostly.train(
    config={
        "name": "ACS Income",
        "tables": [
            {
                "name": "census",
                "data": train,
                "tabularModelConfiguration": {
                    "max_training_time": 10,  # Limit training time (minutes)
                    "enable_model_report": False,  # We do QA separate
                },
            }
        ],
    },
)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"‚úÖ Mostly AI training completed in {elapsed_minutes:.2f} minutes")

In [None]:
print("üé≤ Starting Mostly AI synthetic data generation...")

start_time = time.time()

# Generate synthetic data using the trained generator
# size parameter controls how many records to generate
target_rows = len(data)
sd = mostly.generate(g, size=target_rows)
mostlyai_synthetic_data = sd.data()

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60

print(f"‚úÖ Mostly AI generation completed in {elapsed_minutes:.2f} minutes")
print(f"‚è±Ô∏è Generation rate: {target_rows / (end_time - start_time):,.0f} records/second")
print(f"üìä Generated {len(mostlyai_synthetic_data):,} synthetic records")

# Quick preview of generated data
print("\nFirst 5 synthetic records:")
mostlyai_synthetic_data.head()

In [None]:
# Save Mostly AI synthetic data for comparison
output_file = "./data/mostlyai_synthetic_data.parquet"
mostlyai_synthetic_data.to_parquet(output_file, index=False)
file_size_bytes = os.path.getsize(output_file)
print(f"üíæ MOSTLY AI synthetic data saved to: {output_file}")
print(f"üìÅ File size: {file_size_bytes / 1024**2:.1f} MB")

# 5. Quality Assessment and Comparison

## Evaluation Framework
We'll use MOSTLY AI's comprehensive [Synthetic Data Quality Assurance](https://github.com/mostly-ai/mostlyai-qa) framework to evaluate both synthetic datasets. The assessment includes:

- **Accuracy Metrics**: How well synthetic data preserves statistical distributions (univariate, bivariate, trivariate)
- **Similarity Analysis**: Comparison between training, holdout, and synthetic data
- **DCR Privacy Metrics**: Distance to Closest Record analysis for privacy assessment
- **Overall Quality Score**: Combined metric for synthetic data fidelity

### Key Privacy Metrics:
- **DCR Share**: Proportion of synthetic records that are closer to holdout than training data (higher = better privacy)
- **DCR Training**: Average distance from synthetic to closest training record (higher = better privacy)
- **Optimal DCR Share**: ~0.5 indicates good balance between utility and privacy

Let's compare the results from both frameworks:

In [None]:
# Import and initialize the quality assessment framework
from mostlyai import qa

# Initialize logging to see detailed evaluation progress
qa.init_logging()
print("üîç Quality assessment framework initialized")

In [None]:
print("üìä Evaluating SDV synthetic data quality...")

# Load the SDV synthetic dataset
sdv_synthetic_data = pd.read_parquet("./data/sdv_synthetic_data.parquet")

# Run comprehensive quality assessment
# This compares synthetic data against training and holdout sets
report_path, metrics = qa.report(
    syn_tgt_data=sdv_synthetic_data,  # SDV synthetic data
    trn_tgt_data=train,  # Original training data
    hol_tgt_data=holdout,  # Holdout data for validation
    max_sample_size_embeddings=10_000,  # Limit sample size for efficiency
    report_path="sdv_qa_report.html",  # HTML report output
)

print(f"üìã SDV Quality Report saved to: {report_path}")
print("\nüìà SDV Quality Metrics:")
print(metrics.model_dump_json(indent=4))

# Extract key metrics for comparison
sdv_accuracy = metrics.accuracy.overall
sdv_dcr_share = metrics.distances.dcr_share
sdv_dcr_training = metrics.distances.dcr_training
print("\nüéØ SDV Summary:")
print(f"   Overall Accuracy: {sdv_accuracy:.3f}")
print(f"   DCR Share: {sdv_dcr_share:.3f} (higher is better for privacy)")
print(f"   DCR Training: {sdv_dcr_training:.3f} (higher is better for privacy)")

In [None]:
print("üìä Evaluating Mostly AI synthetic data quality...")

# Load the Mostly AI synthetic dataset
mostlyai_synthetic_data = pd.read_parquet("./data/mostlyai_synthetic_data.parquet")

# Run comprehensive quality assessment for Mostly AI
report_path, metrics = qa.report(
    syn_tgt_data=mostlyai_synthetic_data,  # Mostly AI synthetic data
    trn_tgt_data=train,  # Original training data
    hol_tgt_data=holdout,  # Holdout data for validation
    max_sample_size_embeddings=10_000,  # Limit sample size for efficiency
    report_path="mostlyai_qa_report.html",  # HTML report output
)

print(f"üìã Mostly AI Quality Report saved to: {report_path}")
print("\nüìà Mostly AI Quality Metrics:")
print(metrics.model_dump_json(indent=4))

# Extract key metrics for comparison
mai_accuracy = metrics.accuracy.overall
mai_dcr_share = metrics.distances.dcr_share
mai_dcr_training = metrics.distances.dcr_training
print("\nüéØ Mostly AI Summary:")
print(f"   Overall Accuracy: {mai_accuracy:.3f}")
print(f"   DCR Share: {mai_dcr_share:.3f} (higher is better for privacy)")
print(f"   DCR Training: {mai_dcr_training:.3f} (higher is better for privacy)")

In [None]:
# Add a final comparison section
print("\n" + "=" * 60)
print("üèÜ FINAL COMPARISON")
print("=" * 60)
print(f"SDV      - Accuracy: {sdv_accuracy:.3f}, DCR Share: {sdv_dcr_share:.3f}")
print(f"MostlyAI - Accuracy: {mai_accuracy:.3f}, DCR Share: {mai_dcr_share:.3f}")
print("\nInterpretation:")
print("‚Ä¢ Higher accuracy = better statistical fidelity")
print("‚Ä¢ Higher DCR Share = better privacy preservation (more diverse synthetic records)")
print("‚Ä¢ DCR Share ~0.5 indicates good balance between utility and privacy")
print("‚Ä¢ Check HTML reports for detailed analysis")