Skip to content

API Reference

Engine Reference

mostlyai.engine.split

split(
    tgt_data,
    *,
    ctx_data=None,
    tgt_primary_key=None,
    ctx_primary_key=None,
    tgt_context_key=None,
    model_type=None,
    tgt_encoding_types=None,
    ctx_encoding_types=None,
    n_partitions=1,
    workspace_dir="engine-ws",
    update_progress=None
)

Splits the provided original data into training and validation sets, and stores these as partitioned Parquet files. This is a simplified version of mostlyai-data, tailored towards single- and two-table use cases, while requiring all data to be passed as DataFrames in memory.

Creates the following folder structure within the workspace_dir:

  • OriginalData/tgt-data: Partitioned target data files.
  • OriginalData/tgt-meta: Metadata files for target data.
  • OriginalData/ctx-data: Partitioned context data files (if context is provided).
  • OriginalData/ctx-meta: Metadata files for context data (if context is provided).

Parameters:

Name Type Description Default
tgt_data DataFrame

DataFrame containing the target data.

required
ctx_data DataFrame | None

DataFrame containing the context data.

None
tgt_primary_key str | None

Primary key column name in the target data.

None
ctx_primary_key str | None

Primary key column name in the context data.

None
tgt_context_key str | None

Context key column name in the target data.

None
model_type str | ModelType | None

Model type for the target data. If not provided, it will be inferred from the encoding types, or set to TABULAR by default.

None
tgt_encoding_types dict[str, str | ModelEncodingType] | None

Encoding types for columns in the target data (excluding key columns).

None
ctx_encoding_types dict[str, str | ModelEncodingType] | None

Encoding types for columns in the context data (excluding key columns).

None
n_partitions int

Number of partitions to split the data into.

1
workspace_dir str | Path

Path to the workspace directory where files will be created.

'engine-ws'
update_progress ProgressCallback | None

A custom progress callback.

None

mostlyai.engine.analyze

analyze(
    *,
    value_protection=True,
    workspace_dir="engine-ws",
    update_progress=None
)

Generates (privacy-safe) column-level statistics of the original data, that has been split into the workspace. This information is required for encoding the original as well as for decoding the generating data.

Creates the following folder structure within the workspace_dir:

  • ModelStore/tgt-stats/stats.json: Column-level statistics for target data
  • ModelStore/ctx-stats/stats.json: Column-level statistics for context data (if context is provided).

Parameters:

Name Type Description Default
value_protection bool

Whether to enable value protection for rare values.

True
workspace_dir str | Path

Path to workspace directory containing partitioned data.

'engine-ws'
update_progress ProgressCallback | None

Optional callback to update progress during analysis.

None

mostlyai.engine.encode

encode(*, workspace_dir='engine-ws', update_progress=None)

Encodes data in the workspace that has already been split and analyzed.

Creates the following folder structure within the workspace_dir:

  • OriginalData/encoded-data: Encoded data for training, stored as parquet files.

Parameters:

Name Type Description Default
workspace_dir str | Path

Directory path for workspace.

'engine-ws'
update_progress ProgressCallback | None

Callback for progress updates.

None

mostlyai.engine.train

train(
    *,
    model=None,
    max_training_time=14400.0,
    max_epochs=100.0,
    batch_size=None,
    gradient_accumulation_steps=None,
    enable_flexible_generation=True,
    max_sequence_window=None,
    differential_privacy=None,
    model_state_strategy=ModelStateStrategy.reset,
    device=None,
    workspace_dir="engine-ws",
    update_progress=None,
    upload_model_data_callback=None
)

Trains a model with optional early stopping and differential privacy.

Creates the following folder structure within the workspace_dir:

  • ModelStore: Trained model checkpoints and logs.

Parameters:

Name Type Description Default
model str | None

The identifier of the model to train. If tabular, defaults to MOSTLY_AI/Medium. If language, defaults to MOSTLY_AI/LSTMFromScratch-3m.

None
max_training_time float

Maximum training time in minutes.

14400.0
max_epochs float

Maximum number of training epochs.

100.0
batch_size int | None

Per-device batch size for training and validation. If None, determined automatically.

None
gradient_accumulation_steps int | None

Number of steps to accumulate gradients. If None, determined automatically.

None
enable_flexible_generation bool

Whether to enable flexible order generation. Defaults to True.

True
max_sequence_window int | None

Maximum sequence window for tabular sequential models. Only applicable for tabular models.

None
differential_privacy DifferentialPrivacyConfig | dict | None

Configuration for differential privacy training. If None, DP is disabled.

None
model_state_strategy ModelStateStrategy

Strategy for handling existing model state (reset/resume/reuse).

reset
device device | str | None

Device to run training on ('cuda' or 'cpu'). Defaults to 'cuda' if available, else 'cpu'.

None
workspace_dir str | Path

Directory path for workspace. Training outputs are stored in ModelStore subdirectory.

'engine-ws'
update_progress ProgressCallback | None

Callback function to report training progress.

None
upload_model_data_callback Callable | None

Callback function to upload model data during training.

None

mostlyai.engine.generate

generate(
    *,
    ctx_data=None,
    seed_data=None,
    sample_size=None,
    batch_size=None,
    sampling_temperature=1.0,
    sampling_top_p=1.0,
    device=None,
    rare_category_replacement_method=None,
    rebalancing=None,
    imputation=None,
    fairness=None,
    workspace_dir="engine-ws",
    update_progress=None
)

Generates synthetic data from a trained model.

Creates the following folder structure within the workspace_dir:

  • SyntheticData: Generated synthetic data, stored as parquet files.

Parameters:

Name Type Description Default
ctx_data DataFrame | None

Context data to be used for generation.

None
seed_data DataFrame | None

Seed data to condition generation on fixed target columns.

None
sample_size int | None

Number of samples to generate. Defaults to number of original samples.

None
batch_size int | None

Batch size for generation. If None, determined automatically.

None
sampling_temperature float

Sampling temperature. Higher values increase randomness.

1.0
sampling_top_p float

Nucleus sampling probability threshold.

1.0
device str | None

Device to run generation on ('cuda' or 'cpu'). Defaults to 'cuda' if available, else 'cpu'.

None
rare_category_replacement_method RareCategoryReplacementMethod | str | None

Method for handling rare categories. Only applicable for tabular models.

None
rebalancing RebalancingConfig | dict | None

Configuration for rebalancing column distributions. Only applicable for tabular models.

None
imputation ImputationConfig | dict | None

List of columns to impute missing values. Only applicable for tabular models.

None
fairness FairnessConfig | dict | None

Configuration for fairness constraints. Only applicable for tabular models.

None
workspace_dir str | Path

Directory path for workspace.

'engine-ws'
update_progress ProgressCallback | None

Callback for progress updates.

None

Schema Reference

mostlyai.engine.domain.DifferentialPrivacyConfig

The differential privacy configuration for training the model. If not provided, then no differential privacy will be applied.

Parameters:

Name Type Description Default
max_epsilon float | None

Specifies the maximum allowable epsilon value. If the training process exceeds this threshold, it will be terminated early. Only model checkpoints with epsilon values below this limit will be retained. If not provided, the training will proceed without early termination based on epsilon constraints.

None
noise_multiplier float

The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (How much noise to add).

1.5
max_grad_norm float

The maximum norm of the per-sample gradients for training the model with differential privacy.

1.0
delta float

The delta value for differential privacy. It is the probability of the privacy guarantee not holding. The smaller the delta, the more confident you can be that the privacy guarantee holds.

1e-05

mostlyai.engine.domain.FairnessConfig

Configure a fairness objective for the table.

The generated synthetic data will maintain robust statistical parity between the target column and the specified sensitive columns. All these columns must be categorical.

Parameters:

Name Type Description Default
target_column str
required
sensitive_columns list[str]
required

mostlyai.engine.domain.ImputationConfig

Configure imputation. Imputed columns will suppress the sampling of NULL values.

Parameters:

Name Type Description Default
columns list[str]

The names of the columns to be imputed.

required

mostlyai.engine.domain.ModelEncodingType

The encoding type used for model training and data generation.

  • AUTO: Model chooses among available encoding types based on the column's data type.
  • TABULAR_CATEGORICAL: Model samples from existing (non-rare) categories.
  • TABULAR_NUMERIC_AUTO: Model chooses among 3 numeric encoding types based on the values.
  • TABULAR_NUMERIC_DISCRETE: Model samples from existing discrete numerical values.
  • TABULAR_NUMERIC_BINNED: Model samples from binned buckets, to then sample randomly within a bucket.
  • TABULAR_NUMERIC_DIGIT: Model samples each digit of a numerical value.
  • TABULAR_CHARACTER: Model samples each character of a string value.
  • TABULAR_DATETIME: Model samples each part of a datetime value.
  • TABULAR_DATETIME_RELATIVE: Model samples the relative difference between datetimes within a sequence.
  • TABULAR_LAT_LONG: Model samples a latitude-longitude column. The format is "latitude,longitude".
  • LANGUAGE_TEXT: Model will train a distinct LANGUAGE model for this column, to then generate free text.

mostlyai.engine.domain.ModelStateStrategy

The strategy of how any existing model states and training progress are to be handled.

  • RESET: Start training from scratch. Overwrite any existing model states and training progress.
  • REUSE: Reuse any existing model states, but start progress from scratch. Used for fine-tuning existing models.
  • RESUME: Reuse any existing model states and progress. Used for continuing an aborted training.

mostlyai.engine.domain.ModelType

The type of model.

  • TABULAR: A generative AI model tailored towards tabular data, trained from scratch.
  • LANGUAGE: A generative AI model build upon a (pre-trained) language model.

mostlyai.engine.domain.RareCategoryReplacementMethod

Specifies how rare categories will be sampled. Only applicable if value protection has been enabled.

  • CONSTANT: Replace rare categories by a constant _RARE_ token.
  • SAMPLE: Replace rare categories by a sample from non-rare categories.

mostlyai.engine.domain.RebalancingConfig

Configure rebalancing.

Parameters:

Name Type Description Default
column str

The name of the column to be rebalanced.

required
probabilities dict[str, float]

The target distribution of samples values. The keys are the categorical values, and the values are the probabilities.

required