API Reference
Engine Reference¶
mostlyai.engine.split ¶
split(
tgt_data,
*,
ctx_data=None,
tgt_primary_key=None,
ctx_primary_key=None,
tgt_context_key=None,
model_type=None,
tgt_encoding_types=None,
ctx_encoding_types=None,
n_partitions=1,
workspace_dir="engine-ws",
update_progress=None
)
Splits the provided original data into training and validation sets, and stores these as partitioned Parquet files.
This is a simplified version of mostlyai-data
, tailored towards single- and two-table use cases, while requiring
all data to be passed as DataFrames in memory.
Creates the following folder structure within the workspace_dir
:
OriginalData/tgt-data
: Partitioned target data files.OriginalData/tgt-meta
: Metadata files for target data.OriginalData/ctx-data
: Partitioned context data files (if context is provided).OriginalData/ctx-meta
: Metadata files for context data (if context is provided).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tgt_data
|
DataFrame
|
DataFrame containing the target data. |
required |
ctx_data
|
DataFrame | None
|
DataFrame containing the context data. |
None
|
tgt_primary_key
|
str | None
|
Primary key column name in the target data. |
None
|
ctx_primary_key
|
str | None
|
Primary key column name in the context data. |
None
|
tgt_context_key
|
str | None
|
Context key column name in the target data. |
None
|
model_type
|
str | ModelType | None
|
Model type for the target data. If not provided, it will be inferred from the encoding types, or set to TABULAR by default. |
None
|
tgt_encoding_types
|
dict[str, str | ModelEncodingType] | None
|
Encoding types for columns in the target data (excluding key columns). |
None
|
ctx_encoding_types
|
dict[str, str | ModelEncodingType] | None
|
Encoding types for columns in the context data (excluding key columns). |
None
|
n_partitions
|
int
|
Number of partitions to split the data into. |
1
|
workspace_dir
|
str | Path
|
Path to the workspace directory where files will be created. |
'engine-ws'
|
update_progress
|
ProgressCallback | None
|
A custom progress callback. |
None
|
mostlyai.engine.analyze ¶
Generates (privacy-safe) column-level statistics of the original data, that has been split
into the workspace.
This information is required for encoding the original as well as for decoding the generating data.
Creates the following folder structure within the workspace_dir
:
ModelStore/tgt-stats/stats.json
: Column-level statistics for target dataModelStore/ctx-stats/stats.json
: Column-level statistics for context data (if context is provided).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
value_protection
|
bool
|
Whether to enable value protection for rare values. |
True
|
workspace_dir
|
str | Path
|
Path to workspace directory containing partitioned data. |
'engine-ws'
|
update_progress
|
ProgressCallback | None
|
Optional callback to update progress during analysis. |
None
|
mostlyai.engine.encode ¶
Encodes data in the workspace that has already been split and analyzed.
Creates the following folder structure within the workspace_dir
:
OriginalData/encoded-data
: Encoded data for training, stored as parquet files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
workspace_dir
|
str | Path
|
Directory path for workspace. |
'engine-ws'
|
update_progress
|
ProgressCallback | None
|
Callback for progress updates. |
None
|
mostlyai.engine.train ¶
train(
*,
model=None,
max_training_time=14400.0,
max_epochs=100.0,
batch_size=None,
gradient_accumulation_steps=None,
enable_flexible_generation=True,
max_sequence_window=None,
differential_privacy=None,
model_state_strategy=ModelStateStrategy.reset,
device=None,
workspace_dir="engine-ws",
update_progress=None,
upload_model_data_callback=None
)
Trains a model with optional early stopping and differential privacy.
Creates the following folder structure within the workspace_dir
:
ModelStore
: Trained model checkpoints and logs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str | None
|
The identifier of the model to train. If tabular, defaults to MOSTLY_AI/Medium. If language, defaults to MOSTLY_AI/LSTMFromScratch-3m. |
None
|
max_training_time
|
float
|
Maximum training time in minutes. |
14400.0
|
max_epochs
|
float
|
Maximum number of training epochs. |
100.0
|
batch_size
|
int | None
|
Per-device batch size for training and validation. If None, determined automatically. |
None
|
gradient_accumulation_steps
|
int | None
|
Number of steps to accumulate gradients. If None, determined automatically. |
None
|
enable_flexible_generation
|
bool
|
Whether to enable flexible order generation. Defaults to True. |
True
|
max_sequence_window
|
int | None
|
Maximum sequence window for tabular sequential models. Only applicable for tabular models. |
None
|
differential_privacy
|
DifferentialPrivacyConfig | dict | None
|
Configuration for differential privacy training. If None, DP is disabled. |
None
|
model_state_strategy
|
ModelStateStrategy
|
Strategy for handling existing model state (reset/resume/reuse). |
reset
|
device
|
device | str | None
|
Device to run training on ('cuda' or 'cpu'). Defaults to 'cuda' if available, else 'cpu'. |
None
|
workspace_dir
|
str | Path
|
Directory path for workspace. Training outputs are stored in ModelStore subdirectory. |
'engine-ws'
|
update_progress
|
ProgressCallback | None
|
Callback function to report training progress. |
None
|
upload_model_data_callback
|
Callable | None
|
Callback function to upload model data during training. |
None
|
mostlyai.engine.generate ¶
generate(
*,
ctx_data=None,
seed_data=None,
sample_size=None,
batch_size=None,
sampling_temperature=1.0,
sampling_top_p=1.0,
device=None,
rare_category_replacement_method=None,
rebalancing=None,
imputation=None,
fairness=None,
workspace_dir="engine-ws",
update_progress=None
)
Generates synthetic data from a trained model.
Creates the following folder structure within the workspace_dir
:
SyntheticData
: Generated synthetic data, stored as parquet files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ctx_data
|
DataFrame | None
|
Context data to be used for generation. |
None
|
seed_data
|
DataFrame | None
|
Seed data to condition generation on fixed target columns. |
None
|
sample_size
|
int | None
|
Number of samples to generate. Defaults to number of original samples. |
None
|
batch_size
|
int | None
|
Batch size for generation. If None, determined automatically. |
None
|
sampling_temperature
|
float
|
Sampling temperature. Higher values increase randomness. |
1.0
|
sampling_top_p
|
float
|
Nucleus sampling probability threshold. |
1.0
|
device
|
str | None
|
Device to run generation on ('cuda' or 'cpu'). Defaults to 'cuda' if available, else 'cpu'. |
None
|
rare_category_replacement_method
|
RareCategoryReplacementMethod | str | None
|
Method for handling rare categories. Only applicable for tabular models. |
None
|
rebalancing
|
RebalancingConfig | dict | None
|
Configuration for rebalancing column distributions. Only applicable for tabular models. |
None
|
imputation
|
ImputationConfig | dict | None
|
List of columns to impute missing values. Only applicable for tabular models. |
None
|
fairness
|
FairnessConfig | dict | None
|
Configuration for fairness constraints. Only applicable for tabular models. |
None
|
workspace_dir
|
str | Path
|
Directory path for workspace. |
'engine-ws'
|
update_progress
|
ProgressCallback | None
|
Callback for progress updates. |
None
|
Schema Reference¶
mostlyai.engine.domain.DifferentialPrivacyConfig ¶
The differential privacy configuration for training the model. If not provided, then no differential privacy will be applied.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_epsilon
|
float | None
|
Specifies the maximum allowable epsilon value. If the training process exceeds this threshold, it will be terminated early. Only model checkpoints with epsilon values below this limit will be retained. If not provided, the training will proceed without early termination based on epsilon constraints. |
None
|
noise_multiplier
|
float
|
The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (How much noise to add). |
1.5
|
max_grad_norm
|
float
|
The maximum norm of the per-sample gradients for training the model with differential privacy. |
1.0
|
delta
|
float
|
The delta value for differential privacy. It is the probability of the privacy guarantee not holding. The smaller the delta, the more confident you can be that the privacy guarantee holds. |
1e-05
|
mostlyai.engine.domain.FairnessConfig ¶
Configure a fairness objective for the table.
The generated synthetic data will maintain robust statistical parity between the target column and the specified sensitive columns. All these columns must be categorical.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target_column
|
str
|
|
required |
sensitive_columns
|
list[str]
|
|
required |
mostlyai.engine.domain.ImputationConfig ¶
Configure imputation. Imputed columns will suppress the sampling of NULL values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
list[str]
|
The names of the columns to be imputed. |
required |
mostlyai.engine.domain.ModelEncodingType ¶
The encoding type used for model training and data generation.
AUTO
: Model chooses among available encoding types based on the column's data type.TABULAR_CATEGORICAL
: Model samples from existing (non-rare) categories.TABULAR_NUMERIC_AUTO
: Model chooses among 3 numeric encoding types based on the values.TABULAR_NUMERIC_DISCRETE
: Model samples from existing discrete numerical values.TABULAR_NUMERIC_BINNED
: Model samples from binned buckets, to then sample randomly within a bucket.TABULAR_NUMERIC_DIGIT
: Model samples each digit of a numerical value.TABULAR_CHARACTER
: Model samples each character of a string value.TABULAR_DATETIME
: Model samples each part of a datetime value.TABULAR_DATETIME_RELATIVE
: Model samples the relative difference between datetimes within a sequence.TABULAR_LAT_LONG
: Model samples a latitude-longitude column. The format is "latitude,longitude".LANGUAGE_TEXT
: Model will train a distinct LANGUAGE model for this column, to then generate free text.
mostlyai.engine.domain.ModelStateStrategy ¶
The strategy of how any existing model states and training progress are to be handled.
RESET
: Start training from scratch. Overwrite any existing model states and training progress.REUSE
: Reuse any existing model states, but start progress from scratch. Used for fine-tuning existing models.RESUME
: Reuse any existing model states and progress. Used for continuing an aborted training.
mostlyai.engine.domain.ModelType ¶
The type of model.
TABULAR
: A generative AI model tailored towards tabular data, trained from scratch.LANGUAGE
: A generative AI model build upon a (pre-trained) language model.
mostlyai.engine.domain.RareCategoryReplacementMethod ¶
Specifies how rare categories will be sampled. Only applicable if value protection has been enabled.
CONSTANT
: Replace rare categories by a constant_RARE_
token.SAMPLE
: Replace rare categories by a sample from non-rare categories.
mostlyai.engine.domain.RebalancingConfig ¶
Configure rebalancing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The name of the column to be rebalanced. |
required |
probabilities
|
dict[str, float]
|
The target distribution of samples values. The keys are the categorical values, and the values are the probabilities. |
required |