{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0ff85e19-73bb-4319-a76a-0209382fff1e",
   "metadata": {},
   "source": [
    "# Star Schema Correlation Analysis Tutorial <a href=\"https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/star-schema-correlations/star-schema-correlations.ipynb\" target=\"_blank\"><img src=\"https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab\" alt=\"Run on Colab\"></a>\n",
    "\n",
    "This tutorial demonstrates how our synthetic data generation tool preserves hidden correlations in a **star schema**. The **Players** table serves as the central entity, linking the **Batting** and **Fielding** tables through the foreign key `players_id`. While these tables are not directly connected, they share implicit relationships through common attributes such as player, year, and team.\n",
    "\n",
    "In this tutorial, we will:\n",
    "1. Analyze the correlation between **Batting** and **Fielding** statistics in the original dataset.\n",
    "2. Train a synthetic data generator using the three related tables: **Players, Batting, and Fielding**.\n",
    "3. Demonstrate that training the generator with both related tables together allows it to **implicitly retain correlations** between them, even though they are not directly linked.\n",
    "\n",
    "This approach highlights how our tool **maintains data consistency across related tables**, ensuring synthetic data preserves meaningful statistical relationships.\n",
    "\n",
    "<img src='https://raw.githubusercontent.com/mostly-ai/mostlyai/main/docs/tutorials/star-schema-correlations/baseball_table_relationships.png' width=\"600px\"/>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23c274e1-e576-4aca-9122-eac64cd66f39",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -U mostlyai  # or: pip install -U 'mostlyai[local]'\n",
    "%pip install seaborn matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ef94864-276a-4445-a297-abf7d55a3f38",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e66b2a9c-18e1-4ee6-8c2a-34ad5eba706c",
   "metadata": {},
   "source": [
    "## Step 1: Load and Filter the Data\n",
    "\n",
    "We begin by loading the datasets. The **Players** table serves as the **central entity**, linking the **Batting** and **Fielding** tables through the foreign key `players_id`. These tables contain player performance statistics.\n",
    "\n",
    "To ensure that our analysis focuses on **modern-era players**, we filter the data to retain only players who **played exclusively after 1945**. This ensures that no player with pre-1945 data is included, keeping the dataset relevant for analysis.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9703126-bbbc-4430-a8ae-11d8739ad30b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the datasets\n",
    "players_url = \"https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/baseball/players.csv.gz\"\n",
    "batting_url = \"https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/baseball/batting.csv.gz\"\n",
    "fielding_url = \"https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev/baseball/fielding.csv.gz\"\n",
    "\n",
    "players_original = pd.read_csv(players_url, compression=\"gzip\", low_memory=False)\n",
    "batting_original = pd.read_csv(batting_url, compression=\"gzip\", low_memory=False)\n",
    "fielding_original = pd.read_csv(fielding_url, compression=\"gzip\", low_memory=False)\n",
    "\n",
    "# Filter data first, keeping only records from post-1945\n",
    "batting_filtered = batting_original[batting_original[\"year\"] > 1945]\n",
    "fielding_filtered = fielding_original[fielding_original[\"year\"] > 1945]\n",
    "\n",
    "# Identify players who played only after 1945\n",
    "valid_batting_players = batting_filtered.groupby(\"players_id\")[\"year\"].min() > 1945\n",
    "valid_fielding_players = fielding_filtered.groupby(\"players_id\")[\"year\"].min() > 1945\n",
    "\n",
    "# Find common players\n",
    "valid_players = valid_batting_players.index.intersection(valid_fielding_players.index)\n",
    "\n",
    "# Filter all datasets\n",
    "players_df = players_original[players_original[\"id\"].isin(valid_players)]\n",
    "batting_df = batting_filtered[batting_filtered[\"players_id\"].isin(valid_players)]\n",
    "fielding_df = fielding_filtered[fielding_filtered[\"players_id\"].isin(valid_players)]\n",
    "\n",
    "# Display the filtered datasets\n",
    "players_df.head(), batting_df.head(), fielding_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3c0fcf7-2bfb-4f81-82b6-d35806975f6b",
   "metadata": {},
   "source": [
    "## Step 2: Merge the Tables\n",
    "\n",
    "To analyze correlations, we merge the **Batting** and **Fielding** tables using the **Players** table as a bridge. Since **batting and fielding are not directly related**, we connect them through their common attributes: `players_id`, `year`, and `team`.\n",
    "\n",
    "- First, we merge **batting statistics** with player details.\n",
    "- Next, we **aggregate fielding statistics**, summing numeric columns (excluding categorical fields).\n",
    "- Finally, we merge **batting and fielding data** using an **outer join**, ensuring that all records are retained even if some statistics are missing.\n",
    "- Missing numerical values from the **outer join** are set to **0** to maintain data consistency.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f720d49-a2f6-492c-b2eb-b02abd6b2bbc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge batting data with player details\n",
    "batting_with_players = pd.merge(batting_df, players_df, left_on=\"players_id\", right_on=\"id\", how=\"inner\")\n",
    "\n",
    "# Aggregate fielding statistics, summing only numeric columns\n",
    "numeric_cols = fielding_df.select_dtypes(include=[\"number\"]).columns.difference([\"players_id\", \"year\", \"team\"])\n",
    "fielding_agg = fielding_df.groupby([\"players_id\", \"year\", \"team\"])[numeric_cols].sum().reset_index()\n",
    "\n",
    "# Merge batting and fielding data using an outer join\n",
    "batting_fielding = pd.merge(batting_with_players, fielding_agg, on=[\"players_id\", \"year\", \"team\"], how=\"outer\")\n",
    "\n",
    "# Get the actual numeric columns that exist in batting_fielding after merging\n",
    "existing_numeric_cols = batting_fielding.select_dtypes(include=[\"number\"]).columns\n",
    "\n",
    "# Set NaN numerical values to 0 only for existing columns\n",
    "batting_fielding[existing_numeric_cols] = batting_fielding[existing_numeric_cols].fillna(0)\n",
    "\n",
    "# Display merged table\n",
    "print(\"\\nCombined Batting and Fielding:\")\n",
    "print(batting_fielding.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa7f0969-f543-44bc-b40f-7b5462ed7600",
   "metadata": {},
   "source": [
    "## Step 3: Correlation Analysis\n",
    "\n",
    "With the merged dataset, we compute the correlation matrix to analyze relationships between **Batting** and **Fielding** statistics. This helps us understand whether certain performance metrics are inherently related in the original data.\n",
    "\n",
    "### Key Steps:\n",
    "1. Reorder the columns to separate **player information**, **batting statistics**, and **fielding statistics**.\n",
    "2. Select **only numeric columns** for correlation analysis.\n",
    "3. Compute and visualize the **correlation matrix** using a heatmap.\n",
    "4. Add visual separators to distinguish correlations **within** and **between** batting and fielding features.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37d5a8fa-af80-4f8c-b788-ca4ec5286c4b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reorder columns: players, then batting, then fielding\n",
    "ordered_columns = [\n",
    "    \"id\",\n",
    "    \"country\",\n",
    "    \"birthDate\",\n",
    "    \"deathDate\",\n",
    "    \"nameFirst\",\n",
    "    \"nameLast\",\n",
    "    \"weight\",\n",
    "    \"height\",\n",
    "    \"bats\",\n",
    "    \"throws\",\n",
    "    \"year\",\n",
    "    \"team\",\n",
    "    \"league\",\n",
    "    \"G_x\",\n",
    "    \"AB\",\n",
    "    \"R\",\n",
    "    \"H\",\n",
    "    \"HR\",\n",
    "    \"RBI\",\n",
    "    \"SB\",\n",
    "    \"CS\",\n",
    "    \"BB\",\n",
    "    \"SO\",\n",
    "    \"G_y\",\n",
    "    \"GS\",\n",
    "    \"InnOuts\",\n",
    "    \"PO\",\n",
    "    \"A\",\n",
    "    \"E\",\n",
    "    \"DP\",\n",
    "]\n",
    "\n",
    "# Ensure the columns are in the correct order\n",
    "batting_fielding_ordered = batting_fielding[ordered_columns]\n",
    "\n",
    "# Select only numeric columns for correlation\n",
    "numeric_columns = batting_fielding_ordered.select_dtypes(include=[\"float64\", \"int64\"]).columns\n",
    "corr_matrix = batting_fielding_ordered[numeric_columns].corr()\n",
    "\n",
    "# Plot the correlation matrix with values\n",
    "plt.figure(figsize=(14, 10))\n",
    "sns.heatmap(\n",
    "    corr_matrix, annot=True, fmt=\".2f\", cmap=\"coolwarm\", cbar=True, square=True, linewidths=0.5, vmin=-1, vmax=1\n",
    ")\n",
    "\n",
    "# Add lines to separate the original tables\n",
    "plt.axvline(x=2, color=\"black\", linewidth=2)  # After players columns\n",
    "plt.axvline(x=13, color=\"black\", linewidth=2)  # After batting columns\n",
    "plt.axhline(y=2, color=\"black\", linewidth=2)  # After players columns\n",
    "plt.axhline(y=13, color=\"black\", linewidth=2)  # After batting columns\n",
    "plt.show()  # Select only numeric columns and fill NaNs with 0\n",
    "numeric_columns = batting_fielding_ordered.select_dtypes(include=[\"float64\", \"int64\"]).columns\n",
    "\n",
    "# Fill NaN values in numeric columns explicitly using .loc\n",
    "batting_fielding_ordered.loc[:, numeric_columns] = batting_fielding_ordered[numeric_columns].fillna(0)\n",
    "\n",
    "# Compute correlation matrix\n",
    "corr_matrix = batting_fielding_ordered[numeric_columns].corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "154a2160-da00-4af8-a3c2-645f69282896",
   "metadata": {},
   "source": [
    "## Step 4: Batting vs. Fielding Correlation Analysis\n",
    "\n",
    "Now that we have computed the full correlation matrix, we focus specifically on **the relationships between batting and fielding statistics**. Since the **Batting** and **Fielding** tables are not directly connected but share common attributes (`players_id`, `year`, `team`), this step helps us observe how performance in one area correlates with the other.\n",
    "\n",
    "### Key Steps:\n",
    "1. **Select relevant columns** from the Batting and Fielding tables.\n",
    "2. **Compute correlations** between each batting and fielding statistic.\n",
    "3. **Visualize the correlation matrix** to highlight key relationships.\n",
    "\n",
    "This analysis helps verify whether batting and fielding statistics exhibit meaningful correlations in the real dataset, which will later be compared with synthetic data to assess the preservation of hidden relationships.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4cea2b39-8444-4785-93a2-b1bbffaf251b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the columns from the batting and fielding tables\n",
    "batting_columns = [\"G_x\", \"AB\", \"R\", \"H\", \"HR\", \"RBI\", \"SB\", \"CS\", \"BB\", \"SO\"]\n",
    "fielding_columns = [\"G_y\", \"GS\", \"InnOuts\", \"PO\", \"A\", \"E\", \"DP\"]\n",
    "\n",
    "# Create a DataFrame to store the correlation values\n",
    "corr_df = pd.DataFrame(index=batting_columns, columns=fielding_columns)\n",
    "\n",
    "# Calculate the correlation for each pair of columns\n",
    "for b_col in batting_columns:\n",
    "    for f_col in fielding_columns:\n",
    "        corr_df.loc[b_col, f_col] = batting_fielding[b_col].corr(batting_fielding[f_col])\n",
    "\n",
    "# Plot the correlation matrix\n",
    "plt.figure(figsize=(10, 8))\n",
    "sns.heatmap(corr_df.astype(float), annot=True, cmap=\"coolwarm\", cbar=True, linewidths=0.5, vmin=-1, vmax=1)\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c94848b6-99f1-48d5-b630-2d12fe53ed6c",
   "metadata": {},
   "source": [
    "## Step 5: Define Generator Configurations\n",
    "\n",
    "Now that we have analyzed correlations in the real dataset, we define the generator configuration for training on both **Batting** and **Fielding** tables simultaneously. This setup enables the synthetic data generator to automatically learn and maintain correlations between the two related tables.\n",
    "\n",
    "### Key Configuration Details:\n",
    "- **Players Table** acts as the central reference table, with `id` as its primary key.\n",
    "- **Batting Table** and **Fielding Table** reference the `players_id` foreign key, establishing their connection to the **Players Table**.\n",
    "- Both related tables are configured with `\"is_context\": True`, ensuring that player-level information is used when generating synthetic data.\n",
    "- The generator is set with a maximum training time of **30 minutes** and **value protection disabled** for unrestricted data generation.\n",
    "\n",
    "This step ensures that when synthetic data is generated, the correlations between **Batting** and **Fielding** statistics are preserved.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3b70f21-d6ef-4b37-a332-6e939972b0fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from mostlyai.sdk import MostlyAI\n",
    "\n",
    "# initialize SDK\n",
    "mostly = MostlyAI()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb8ca249-fdf9-4646-8908-9d3313b52565",
   "metadata": {},
   "outputs": [],
   "source": [
    "players_table_config = {\n",
    "    \"name\": \"players\",\n",
    "    \"data\": players_df,\n",
    "    \"tabular_model_configuration\": {\n",
    "        \"max_training_time\": 30,\n",
    "        \"value_protection\": False,\n",
    "        \"enable_flexible_generation\": False,\n",
    "    },\n",
    "    \"primary_key\": \"id\",\n",
    "}\n",
    "\n",
    "batting_table_config = {\n",
    "    \"name\": \"batting\",\n",
    "    \"data\": batting_df,\n",
    "    \"tabular_model_configuration\": {\n",
    "        \"max_training_time\": 30,\n",
    "        \"value_protection\": False,\n",
    "        \"enable_flexible_generation\": False,\n",
    "    },\n",
    "    \"foreign_keys\": [{\"column\": \"players_id\", \"referenced_table\": \"players\", \"is_context\": True}],\n",
    "}\n",
    "\n",
    "fielding_table_config = {\n",
    "    \"name\": \"fielding\",\n",
    "    \"data\": fielding_agg,\n",
    "    \"tabular_model_configuration\": {\n",
    "        \"max_training_time\": 30,\n",
    "        \"value_protection\": False,\n",
    "        \"enable_flexible_generation\": False,\n",
    "    },\n",
    "    \"foreign_keys\": [{\"column\": \"players_id\", \"referenced_table\": \"players\", \"is_context\": True}],\n",
    "}\n",
    "\n",
    "generator_config = {\n",
    "    \"name\": \"Multi-table Correlation Tutorial - Baseball Player->Batting,Fielding Generator\",\n",
    "    \"tables\": [players_table_config, batting_table_config, fielding_table_config],\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "617f58fb-1924-4d91-bec3-fc0e0cbf7f4f",
   "metadata": {},
   "source": [
    "## Step 6: Training the Generators\n",
    "\n",
    "After defining the generator configurations, we now proceed with training the synthetic data generator. This step enables the model to learn patterns and correlations from the **Players, Batting, and Fielding** tables, ensuring that the relationships observed in the real data are preserved in the generated synthetic data.\n",
    "\n",
    "The generator will:\n",
    "- Learn **player-specific** characteristics from the **Players** table.\n",
    "- Capture **batting and fielding statistics** while maintaining their correlations through the shared **players_id** key.\n",
    "- Automatically manage dependencies between **Batting** and **Fielding**, ensuring a realistic multi-table generation process.\n",
    "\n",
    "Once the training process is complete, the generator will be ready to create synthetic datasets that retain the hidden correlations between these tables.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc29182c-08a2-4e3c-97f6-468b50915130",
   "metadata": {},
   "outputs": [],
   "source": [
    "# training the full_generator\n",
    "generator = mostly.train(config=generator_config)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67632574-8930-46b7-ab97-902e0c3c688a",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"{generator.id} {generator.name} - {generator.accuracy}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae227d7f-09b4-46be-b357-1717ce6dbb7a",
   "metadata": {},
   "source": [
    "## Step 7: Generate Synthetic Data Using the Original Players as Seed\n",
    "\n",
    "To ensure a meaningful comparison between real and synthetic correlations, we generate **synthetic Batting and Fielding tables** while using the **original Players table as a seed**. This process, known as **conditional generation**, ensures that the synthetic data retains the same player distribution as the real dataset.\n",
    "\n",
    "### **Why Use the Original Players as Seed?**\n",
    "- It guarantees that synthetic **batting and fielding statistics** are generated **for the same set of players** as in the real dataset.\n",
    "- It allows for a **direct comparison** between the original and synthetic correlation matrices.\n",
    "\n",
    "With this approach, we can later assess how well the synthetic dataset preserves the relationships between Batting and Fielding.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45a4b86e-bc17-41ef-8f18-814a16868c45",
   "metadata": {},
   "outputs": [],
   "source": [
    "synthetic_dataset = mostly.generate(generator, seed={\"players\": players_df})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56b91897-4963-4aea-ba64-1c59dd62bb1d",
   "metadata": {},
   "source": [
    "## Step 8: Merging the Synthetic Batting and Fielding Tables\n",
    "\n",
    "After generating synthetic data, we need to merge the **synthetic Batting and Fielding tables** in the same way as we did with the real data. This ensures that our synthetic dataset follows the same structure, allowing for a direct comparison of correlations.\n",
    "\n",
    "### Key Steps:\n",
    "1. Extract **synthetic** Players, Batting, and Fielding tables from the generated dataset.\n",
    "2. Merge **Batting** with **Players** using `players_id` as the foreign key.\n",
    "3. Merge the resulting dataset with **Fielding**, using an **outer join** to retain all records.\n",
    "\n",
    "By structuring the synthetic data identically to the real dataset, we can later perform a correlation analysis and assess how well the synthetic data preserves relationships between Batting and Fielding.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3988d207-2d10-46d5-aae1-036642d1d285",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Getting the dataframes from the synthetic dataset\n",
    "players_df_syn = synthetic_dataset.data()[\"players\"]  # same as the seed used\n",
    "batting_df_syn = synthetic_dataset.data()[\"batting\"]\n",
    "fielding_df_syn = synthetic_dataset.data()[\"fielding\"]\n",
    "\n",
    "# Merge sequence tables with the flat table\n",
    "batting_with_players_syn = pd.merge(batting_df_syn, players_df_syn, left_on=\"players_id\", right_on=\"id\", how=\"inner\")\n",
    "\n",
    "# Join batting and fielding through the players table\n",
    "batting_fielding_syn = pd.merge(\n",
    "    batting_with_players_syn, fielding_df_syn, on=[\"players_id\", \"year\", \"team\"], how=\"outer\"\n",
    ")\n",
    "\n",
    "# Display merged tables\n",
    "print(\"\\nCombined Synthetic Batting and Fielding:\")\n",
    "print(batting_fielding_syn.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "813302f1-5fdb-4d37-ab36-19a050b865ff",
   "metadata": {},
   "source": [
    "## Step 9: Comparing Correlations Between Real and Synthetic Data\n",
    "\n",
    "Now that we have merged the synthetic **Batting** and **Fielding** tables, we analyze the correlation matrix for the synthetic dataset and compare it with the original data.\n",
    "\n",
    "### Key Steps:\n",
    "1. **Compute the correlation matrix** for the synthetic dataset.\n",
    "2. **Visualize the synthetic correlation matrix** to check if the statistical relationships between batting and fielding are maintained.\n",
    "3. **Compare side-by-side** the zoomed-in correlation matrices between Batting and Fielding for both **real and synthetic data**.\n",
    "\n",
    "This comparison helps us evaluate how well the generator preserved the **hidden correlations** in the synthetic dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f7e8fe7-8618-493d-bc2d-061ea58b4a7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reorder columns: players, then batting, then fielding\n",
    "ordered_columns = [\n",
    "    \"id\",\n",
    "    \"country\",\n",
    "    \"birthDate\",\n",
    "    \"deathDate\",\n",
    "    \"nameFirst\",\n",
    "    \"nameLast\",\n",
    "    \"weight\",\n",
    "    \"height\",\n",
    "    \"bats\",\n",
    "    \"throws\",\n",
    "    \"year\",\n",
    "    \"team\",\n",
    "    \"league_y\",\n",
    "    \"G_x\",\n",
    "    \"AB\",\n",
    "    \"R\",\n",
    "    \"H\",\n",
    "    \"HR\",\n",
    "    \"RBI\",\n",
    "    \"SB\",\n",
    "    \"CS\",\n",
    "    \"BB\",\n",
    "    \"SO\",\n",
    "    \"G_y\",\n",
    "    \"GS\",\n",
    "    \"InnOuts\",\n",
    "    \"PO\",\n",
    "    \"A\",\n",
    "    \"E\",\n",
    "    \"DP\",\n",
    "]\n",
    "\n",
    "# Ensure the columns are in the correct order\n",
    "batting_fielding_syn_ordered = batting_fielding_syn[ordered_columns]\n",
    "\n",
    "# Select only numeric columns for correlation\n",
    "numeric_columns = batting_fielding_syn_ordered.select_dtypes(include=[\"float64\", \"int64\"]).columns\n",
    "corr_matrix_syn = batting_fielding_syn_ordered[numeric_columns].corr()\n",
    "\n",
    "# Plot the correlation matrix with values\n",
    "plt.figure(figsize=(14, 10))\n",
    "sns.heatmap(\n",
    "    corr_matrix_syn, annot=True, fmt=\".2f\", cmap=\"coolwarm\", cbar=True, square=True, linewidths=0.5, vmin=-1, vmax=1\n",
    ")\n",
    "\n",
    "# Add lines to separate the original tables\n",
    "plt.axvline(x=2, color=\"black\", linewidth=2)  # After players columns\n",
    "plt.axvline(x=13, color=\"black\", linewidth=2)  # After batting columns\n",
    "plt.axhline(y=2, color=\"black\", linewidth=2)  # After players columns\n",
    "plt.axhline(y=13, color=\"black\", linewidth=2)  # After batting columns\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7589775c-2a5b-4db2-b685-28f9cce5597e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract only Batting vs. Fielding correlations for both real and synthetic data\n",
    "corr_df_real = corr_matrix.loc[batting_columns, fielding_columns]\n",
    "corr_df_syn = corr_matrix_syn.loc[batting_columns, fielding_columns]\n",
    "\n",
    "# Create a side-by-side plot\n",
    "fig, axes = plt.subplots(1, 2, figsize=(16, 8))\n",
    "\n",
    "# Plot real data correlations\n",
    "sns.heatmap(\n",
    "    corr_df_real.astype(float), annot=True, cmap=\"coolwarm\", cbar=True, linewidths=0.5, vmin=-1, vmax=1, ax=axes[0]\n",
    ")\n",
    "axes[0].set_title(\"Real Data: Batting vs. Fielding Correlations\")\n",
    "\n",
    "# Plot synthetic data correlations\n",
    "sns.heatmap(\n",
    "    corr_df_syn.astype(float), annot=True, cmap=\"coolwarm\", cbar=True, linewidths=0.5, vmin=-1, vmax=1, ax=axes[1]\n",
    ")\n",
    "axes[1].set_title(\"Synthetic Data: Batting vs. Fielding Correlations\")\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67a95a37-8d48-44f2-914f-7f6172a1b0cb",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "In this tutorial, we demonstrated how to analyze **hidden correlations** between two related tables (**Batting** and **Fielding**) that are not directly linked but share common attributes. We followed a structured approach:\n",
    "\n",
    "1. **Explored correlations in real data** by merging Batting and Fielding through the **Players** table and computing a correlation matrix.\n",
    "2. **Trained a synthetic data generator** using the original tables while ensuring dependencies were properly modeled.\n",
    "3. **Generated synthetic data** with the same set of players to maintain consistency.\n",
    "4. **Merged synthetic Batting and Fielding tables** and analyzed their correlation matrix.\n",
    "5. **Compared real and synthetic correlations** to assess how well the generator preserved hidden relationships.\n",
    "\n",
    "### **Key Takeaways**\n",
    "- The **Players table** acts as a bridge, allowing correlations between Batting and Fielding to emerge naturally.\n",
    "- Training the generator with **both Batting and Fielding tables together** helps maintain these correlations.\n",
    "- The **side-by-side comparison** of correlation matrices allows us to measure how well the synthetic data preserves relationships found in the original dataset.\n",
    "- Synthetic data can successfully retain complex dependencies, making it useful for **privacy-preserving data analysis**.\n",
    "\n",
    "### **Next Steps**\n",
    "- **Validate correlations statistically**: Beyond visual analysis, we can quantify correlation differences using metrics such as **mean absolute correlation difference (MACD)**.\n",
    "- **Apply to other domains**: The same methodology can be applied to financial data, healthcare records, or customer behavior analysis to ensure synthetic data maintains key statistical properties.\n",
    "\n",
    "By following this process, we can ensure that synthetic data is **both privacy-safe and statistically valuable**, unlocking new possibilities for AI-driven data exploration.\n",
    "\n",
    "🚀 **Now it's your turn!** Try modifying the dataset or generator settings to see how correlations change and adapt to different data structures.  \n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}