Skip to content

Schema References for mostlyai.domain

This module is auto-generated to represent pydantic-based classes of the defined schema in the Public API.

mostlyai.domain

AboutService

General information about the service.

Parameters:

Name Type Description Default
version str | None

The version number of the service.

None
assistant bool | None

A flag indicating if the assistant is enabled.

None

Accuracy

Metrics regarding the accuracy of synthetic data, measured as the closeness of discretized lower dimensional marginal distributions.

  1. Univariate Accuracy: The accuracy of the univariate distributions for all target columns.
  2. Bivariate Accuracy: The accuracy of all pair-wise distributions for target columns, as well as for target columns with respect to the context columns.
  3. Coherence Accuracy: The accuracy of the auto-correlation for all target columns.

Accuracy is defined as 100% - Total Variation Distance (TVD), whereas TVD is half the sum of the absolute differences of the relative frequencies of the corresponding distributions.

These accuracies are calculated for all discretized univariate, and bivariate distributions. In case of sequential data, also for all coherence distributions. Overall metrics are then calculated as the average across these accuracies.

All metrics can be compared against a theoretical maximum accuracy, which is calculated for a same-sized holdout. The accuracy metrics shall be as close as possible to the theoretical maximum, but not significantly higher, as this would indicate overfitting.

Parameters:

Name Type Description Default
overall float | None

Overall accuracy of synthetic data, averaged across univariate, bivariate, and coherence.

None
univariate float | None

Average accuracy of discretized univariate distributions.

None
bivariate float | None

Average accuracy of discretized bivariate distributions.

None
coherence float | None

Average accuracy of discretized coherence distributions. Only applicable for sequential data.

None
overall_max float | None

Expected overall accuracy of a same-sized holdout. Serves as a reference for overall.

None
univariate_max float | None

Expected univariate accuracy of a same-sized holdout. Serves as a reference for univariate.

None
bivariate_max float | None

Expected bivariate accuracy of a same-sized holdout. Serves as a reference for bivariate.

None
coherence_max float | None

Expected coherence accuracy of a same-sized holdout. Serves as a reference for coherence.

None

BaseResource

A set of common properties across resources.

Parameters:

Name Type Description Default
id str | None

The unique identifier of the entity.

None
name str | None

The name of the entity.

None
uri str | None

The API service endpoint of the entity

None
current_user_permission_level PermissionLevel | None
None
current_user_like_status bool | None

A boolean indicating whether the user has liked the entity or not

None

Compute

A compute resource for executing tasks.

Parameters:

Name Type Description Default
id str | None
None
name str | None
None
type ComputeType | None
None
config dict[str, Any] | None
None
secrets dict[str, Any] | None
None
resources ComputeResources | None
None
order_index int | None

The index for determining the sort order when listing computes

None

ComputeConfig

The configuration for creating a new compute resource.

Parameters:

Name Type Description Default
name str | None
None
type ComputeType | None
None
resources ComputeResources | None
None
config dict[str, Any] | None
None
secrets dict[str, Any] | None
None
order_index int | None

The index for determining the sort order when listing computes

None

ComputeListItem

Essential compute details for listings.

Parameters:

Name Type Description Default
id str | None
None
type ComputeType | None
None
name str | None
None
resources ComputeResources | None
None

ComputeResources

A set of available hardware resources for a compute resource.

Parameters:

Name Type Description Default
cpus int | None

The number of CPU cores

None
memory float | None

The amount of memory in GB

None
gpus int | None

The number of GPUs

0
gpu_memory float | None

The amount of GPU memory in GB

0

ComputeType

The type of compute.

Connector

A connector is a connection to a data source or a data destination.

Parameters:

Name Type Description Default
id str

The unique identifier of a connector.

required
name str

The name of a connector.

required
type ConnectorType
required
access_type ConnectorAccessType
required
config dict[str, Any] | None
None
secrets dict[str, str] | None
None
ssl dict[str, str] | None
None
metadata Metadata | None
None
usage ConnectorUsage | None
None
table_id str | None

Optional. ID of a source table or a synthetic table, that this connector belongs to. If not set, then this connector is managed independently of any generator or synthetic dataset.

None
delete
delete()

Delete the connector.

Returns:

Type Description
None

None

locations
locations(prefix='')

List connector locations.

List the available databases, schemas, tables, or folders for a connector. For storage connectors, this returns list of folders and files at root, respectively at prefix level. For DB connectors, this returns list of schemas (or databases for DBs without schema), respectively list of tables if prefix is provided.

The formats of the locations are:

  • Cloud storage:
    • AZURE_STORAGE: container/path
    • GOOGLE_CLOUD_STORAGE: bucket/path
    • S3_STORAGE: bucket/path
  • Database:
    • BIGQUERY: dataset.table
    • DATABRICKS: schema.table
    • HIVE: database.table
    • MARIADB: database.table
    • MSSQL: schema.table
    • MYSQL: database.table
    • ORACLE: schema.table
    • POSTGRES: schema.table
    • SNOWFLAKE: schema.table

Parameters:

Name Type Description Default
prefix str

The prefix to filter the results by.

''

Returns:

Name Type Description
list list

A list of locations (schemas, databases, directories, etc.).

schema
schema(location)

Retrieve the schema of the table at a connector location. Please refer to locations() for the format of the location.

Parameters:

Name Type Description Default
location str

The location of the table.

required

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: The retrieved schema.

update
update(
    name=None,
    config=None,
    secrets=None,
    ssl=None,
    test_connection=True,
)

Update a connector with specific parameters.

Parameters:

Name Type Description Default
name str | None

The name of the connector.

None
config dict[str, Any]

Connector configuration.

None
secrets dict[str, str]

Secret values for the connector.

None
ssl dict[str, str]

SSL configuration for the connector.

None
test_connection bool | None

If true, validates the connection before saving.

True

ConnectorAccessType

The access type of a connector. Either SOURCE or DESTINATION.

ConnectorConfig

The structures of the config, secrets and ssl parameters depend on the connector type.

  • Cloud storage:
    - type: AZURE_STORAGE
      config:
        accountName: string
        clientId: string (required for auth via service principal)
        tenantId: string (required for auth via service principal)
      secrets:
        accountKey: string (required for regular auth)
        clientSecret: string (required for auth via service principal)
    
    - type: GOOGLE_CLOUD_STORAGE
      config:
      secrets:
        keyFile: string
    
    - type: S3_STORAGE
      config:
        accessKey: string
        endpointUrl: string (only needed for S3-compatible storage services other than AWS)
        sslEnabled: boolean, default: false
      secrets:
        secretKey: string
      ssl:
        caCertificate: base64-encoded string
    
  • Database:
    - type: BIGQUERY
      config:
      secrets:
        keyFile: string
    
    - type: DATABRICKS
      config:
        host: string
        httpPath: string
        catalog: string
        clientId: string (required for auth via service principal)
        tenantId: string (required for auth via service principal)
      secrets:
        accessToken: string (required for regular auth)
        clientSecret: string (required for auth via service principal)
    
    - type: HIVE
      config:
        host: string
        port: integer, default: 10000
        username: string (required for regular auth)
        kerberosEnabled: boolean, default: false
        kerberosServicePrincipal: string (required if kerberosEnabled)
        kerberosClientPrincipal: string (optional if kerberosEnabled)
        kerberosKrb5Conf: string (required if kerberosEnabled)
        sslEnabled: boolean, default: false
      secrets:
        password: string (required for regular auth)
        kerberosKeytab: base64-encoded string (required if kerberosEnabled)
      ssl:
        caCertificate: base64-encoded string
    
    - type: MARIADB
      config:
        host: string
        port: integer, default: 3306
        username: string
      secrets:
        password: string
    
    - type: MSSQL
      config:
        host: string
        port: integer, default: 1433
        username: string
        database: string
      secrets:
       password: string
    
    - type: MYSQL
      config:
        host: string
        port: integer, default: 3306
        username: string
      secrets:
        password: string
    
    - type: ORACLE
      config:
        host: string
        port: integer, default: 1521
        username: string
        connectionType: enum {SID, SERVICE_NAME}, default: SID
        database: string, default: ORCL
      secrets:
        password: string
    
    - type: POSTGRES
      config:
        host: string
        port: integer, default: 5432
        username: string
        database: string
        sslEnabled: boolean, default: false
      secrets:
        password: string
      ssl:
        rootCertificate: base64-encoded string
        sslCertificate: base64-encoded string
        sslCertificateKey: base64-encoded string
    
    - type: SNOWFLAKE
      config:
        account: string
        username: string
        warehouse: string, default: COMPUTE_WH
        database: string
      secrets:
        password: string
    

Parameters:

Name Type Description Default
name str | None

The name of a connector.

None
type ConnectorType
required
access_type ConnectorAccessType | None
<ConnectorAccessType.source: 'SOURCE'>
config dict[str, Any] | None
None
secrets dict[str, str] | None
None
ssl dict[str, str] | None
None

ConnectorListItem

Essential connector details for listings.

Parameters:

Name Type Description Default
id str

The unique identifier of a connector.

required
name str

The name of a connector.

required
type ConnectorType
required
access_type ConnectorAccessType
required
metadata Metadata
required
usage ConnectorUsage | None
None

ConnectorType

The type of a connector.

The type determines the structure of the config, secrets and ssl parameters.

  • MYSQL: MySQL database
  • POSTGRES: PostgreSQL database
  • MSSQL: Microsoft SQL Server database
  • ORACLE: Oracle database
  • MARIADB: MariaDB database
  • SNOWFLAKE: Snowflake cloud data platform
  • BIGQUERY: Google BigQuery cloud data warehouse
  • HIVE: Apache Hive database
  • DATABRICKS: Databricks cloud data platform
  • AZURE_STORAGE: Azure Blob Storage
  • GOOGLE_CLOUD_STORAGE: Google Cloud Storage
  • S3_STORAGE: Amazon S3 Storage
  • FILE_UPLOAD: File upload

ConnectorUsage

Usage statistics of a connector.

Parameters:

Name Type Description Default
no_of_shares int | None

Number of shares of this connector.

None
no_of_generators int | None

Number of generators using this connector.

None

Credits

Parameters:

Name Type Description Default
current float | None

The credit balance for the current time period

None
limit float | None

The credit limit for the current time period. If empty, then there is no limit.

None
period_start datetime | None

The UTC date and time when the current time period started

None
period_end datetime | None

The UTC date and time when the current time period ends

None

CurrentUser

Information on the current user.

Parameters:

Name Type Description Default
id str | None

The unique identifier of a user.

None
first_name str | None

First name of a user

None
last_name str | None

Last name of a user

None
email str | None

The email of a user

None
settings dict[str, Any] | None
None
usage UserUsage | None
None
unread_notifications int | None

Number of unread notifications for the user

None

DifferentialPrivacyConfig

The optional differential privacy configuration for training the model. If not provided, then no differential privacy will be applied.

Parameters:

Name Type Description Default
max_epsilon float | None

Specifies the maximum allowable epsilon value. If the training process exceeds this threshold, it will be terminated early. Only model checkpoints with epsilon values below this limit will be retained. If not provided, the training will proceed without early termination based on epsilon constraints.

None
noise_multiplier float | None

The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (How much noise to add).

1.5
max_grad_norm float | None

The maximum norm of the per-sample gradients for training the model with differential privacy.

1.0

Distances

Metrics regarding the nearest neighbor distances between training, holdout, and synthetic samples in an embedding space. Useful for assessing the novelty / privacy of synthetic data.

The provided data is first down-sampled, so that the number of samples match across all datasets. Note, that for an optimal sensitivity of this privacy assessment it is recommended to use a 50/50 split between training and holdout data, and then generate synthetic data of the same size.

The embeddings of these samples are then computed, and the L2 nearest neighbor distances are calculated for each synthetic sample to the training and holdout samples. Based on these nearest neighbor distances the following metrics are calculated: - Identical Match Share (IMS): The share of synthetic samples that are identical to a training or holdout sample. - Distance to Closest Record (DCR): The average distance of synthetic to training or holdout samples.

For privacy-safe synthetic data we expect to see about as many identical matches, and about the same distances for synthetic samples to training, as we see for synthetic samples to holdout.

Parameters:

Name Type Description Default
ims_training float | None

Share of synthetic samples that are identical to a training sample.

None
ims_holdout float | None

Share of synthetic samples that are identical to a holdout sample. Serves as a reference for ims_training.

None
dcr_training float | None

Average L2 nearest-neighbor distance between synthetic and training samples.

None
dcr_holdout float | None

Average L2 nearest-neighbor distance between synthetic and holdout samples. Serves as a reference for dcr_training.

None
dcr_share float | None

Share of synthetic samples that are closer to a training sample than to a holdout sample. This should not be significantly larger than 50%.

None

FairnessConfig

Configure a fairness objective for the table. Only applicable for a subject table. The generated synthetic data will maintain robust statistical parity between the target column and the specified sensitive columns. All these columns must be categorical.

Parameters:

Name Type Description Default
target_column str

The name of the target column.

required
sensitive_columns list[str]

The names of the sensitive columns.

required

ForeignKey

Parameters:

Name Type Description Default
column str

The column name of a foreign key.

required
referenced_table str

The table name of the referenced table. That table must have a primary key already defined.

required
is_context bool

If true, then the foreign key will be considered as a context relation. Note, that only one foreign key relation per table can be a context relation.

required

Generator

A generator is a set models that can generate synthetic data.

The generator can be trained on one or more source tables. A quality assurance report is generated for each model.

Parameters:

Name Type Description Default
id str

The unique identifier of a generator.

required
name str | None

The name of a generator.

None
description str | None

The description of a generator.

None
training_status ProgressStatus
required
training_time datetime | None

The UTC date and time when the training has finished.

None
usage GeneratorUsage | None
None
metadata Metadata
required
accuracy float | None

The overall accuracy of the trained generator. This is the average of the overall accuracy scores of all trained models.

None
tables list[SourceTable] | None

The tables of this generator

None
training Any | None
None
Training
cancel
cancel()

Cancel training.

progress
progress()

Retrieve job progress of training.

Returns:

Name Type Description
JobProgress JobProgress

The job progress of the training process.

start
start()

Start training.

wait
wait(progress_bar=True, interval=2)

Poll training progress and loop until training has completed.

Parameters:

Name Type Description Default
progress_bar bool

If true, displays the progress bar.

True
interval float

The interval in seconds to poll the job progress.

2
clone
clone(training_status='NEW')

Clone the generator.

Parameters:

Name Type Description Default
training_status Literal['NEW', 'CONTINUE']

The training status of the cloned generator.

'NEW'

Returns:

Name Type Description
Generator Generator

The cloned generator object.

config
config()

Retrieve writable generator properties.

Returns:

Name Type Description
GeneratorConfig GeneratorConfig

The generator properties as a configuration object.

delete
delete()

Delete the generator.

Returns:

Type Description
None

None

export_to_file
export_to_file(file_path=None)

Export generator and save to file.

Parameters:

Name Type Description Default
file_path str | Path | None

The file path to save the generator.

None

Returns:

Type Description
Path

The path to the saved file.

update
update(name=None, description=None)

Update a generator with specific parameters.

Parameters:

Name Type Description Default
name str | None

The name of the generator.

None
description str | None

The description of the generator.

None

GeneratorCloneTrainingStatus

The training status of the new generator. The available options are:

  • NEW: The new generator will re-use existing data and model configurations.
  • CONTINUE: The new generator will re-use existing data and model configurations, as well as model weights.

GeneratorConfig

The configuration for creating a new generator.

Parameters:

Name Type Description Default
name str | None

The name of a generator.

None
description str | None

The description of a generator.

None
tables list[SourceTableConfig] | None

The tables of a generator

None

GeneratorImportFromFileConfig

Parameters:

Name Type Description Default
file bytes
required

GeneratorListItem

Essential generator details for listings.

Parameters:

Name Type Description Default
id str

The unique identifier of a generator.

required
name str | None

The name of a generator.

None
description str | None

The description of a generator.

None
training_status ProgressStatus
required
training_time datetime | None

The UTC date and time when the training has finished.

None
usage GeneratorUsage | None
None
metadata Metadata
required

GeneratorUsage

Usage statistics of a generator.

Parameters:

Name Type Description Default
total_datapoints int | None

The total number of datapoints generated by this generator.

None
total_compute_time int | None

The total compute time in seconds used for training this generator. This is the sum of the compute time of all trained tasks.

None
no_of_synthetic_datasets int | None

Number of synthetic datasets generated by this generator.

None
no_of_shares int | None

Number of shares of this generator.

None
no_of_likes int | None

Number of likes of this generator.

None

ImputationConfig

Configure imputation.

Parameters:

Name Type Description Default
columns list[str]

The names of the columns to be imputed. Imputed columns will suppress the sampling of NULL values.

required

JobProgress

The progress of a job.

Parameters:

Name Type Description Default
id str | None
None
start_date datetime | None

The UTC date and time when the job has started. If the job has not started yet, then this is None.

None
end_date datetime | None

The UTC date and time when the job has ended. If the job is still, then this is None.

None
progress ProgressValue | None
None
status ProgressStatus | None
None
steps list[ProgressStep] | None
None

Metadata

The metadata of a resource.

Parameters:

Name Type Description Default
created_at datetime | None

The UTC date and time when the resource has been created.

None
owner_id str | None

The unique identifier of the owner of the entity.

None
owner_name str | None

The name of the owner of the entity.

None
current_user_permission_level PermissionLevel | None
None
current_user_like_status bool | None

A boolean indicating whether the user has liked the entity or not

None
short_lived_file_token str | None

An auto-generated short-lived file token (slft) for accessing resource artefacts. The token is always restricted to a single resource, only valid for 60 minutes, and only accepted by API endpoints that allow to download single files.

None

ModelConfiguration

The training configuration for the model

Parameters:

Name Type Description Default
model str | None

The model to be used for training.

None
max_sample_size int | None

The maximum number of samples to consider for training. If not provided, then all available samples will be taken.

None
batch_size int | None

The batch size used for training the model. If not provided, batchSize will be chosen automatically.

None
max_training_time float | None

The maximum number of minutes to train the model.

10
max_epochs float | None

The maximum number of epochs to train the model.

100
max_sequence_window int | None

The maximum sequence window to consider for training. Only applicable for TABULAR models.

100
enable_flexible_generation bool | None

If true, then the trained generator can be used for rebalancing and imputation. Only applicable for TABULAR models.

True
value_protection bool | None

Defines if Rare Category, Extreme value, or Sequence length protection will be applied.

True
rare_category_replacement_method RareCategoryReplacementMethod | None

Specifies how rare categories will be sampled. Only applicable if value protection has been enabled.

  • CONSTANT: Replace rare categories by a constant _RARE_ token.
  • SAMPLE: Replace rare categories by a sample from non-rare categories.
<RareCategoryReplacementMethod.constant: 'CONSTANT'>
differential_privacy DifferentialPrivacyConfig | None
None
compute str | None
None

ModelEncodingType

The encoding type used for model training and data generation.

  • AUTO: Model chooses among available encoding types based on the column's data type.
  • TABULAR_CATEGORICAL: Model samples from existing (non-rare) categories.
  • TABULAR_NUMERIC_AUTO: Model chooses among 3 numeric encoding types based on the values.
  • TABULAR_NUMERIC_DISCRETE: Model samples from existing discrete numerical values.
  • TABULAR_NUMERIC_BINNED: Model samples from binned buckets, to then sample randomly within a bucket.
  • TABULAR_NUMERIC_DIGIT: Model samples each digit of a numerical value.
  • TABULAR_CHARACTER: Model samples each character of a string value.
  • TABULAR_DATETIME: Model samples each part of a datetime value.
  • TABULAR_DATETIME_RELATIVE: Model samples the relative difference between datetimes within a sequence.
  • TABULAR_LAT_LONG: Model samples a latitude-longitude column. The format is "latitude,longitude".
  • LANGUAGE_TEXT: Model will train a distinct LANGUAGE model for this column, to then generate free text.

Encoding types, that are not being prefixed with either TABULAR or LANGUAGE have been deprecated.

ModelMetrics

Metrics regarding the quality of synthetic data, measured in terms of accuracy, similarity, and distances.

  1. Accuracy: Metrics regarding the accuracy of synthetic data, measured as the closeness of discretized lower dimensional marginal distributions.
  2. Similarity: Metrics regarding the similarity of the full joint distributions of samples within an embedding space.
  3. Distances: Metrics regarding the nearest neighbor distances between training, holdout, and synthetic samples in an embedding space. Useful for assessing the novelty / privacy of synthetic data.

The quality of synthetic data is assessed by comparing these metrics to the same metrics of a holdout dataset. The holdout dataset is a subset of the original training data, that was not used for training the synthetic data generator. The metrics of the synthetic data should be as close as possible to the metrics of the holdout data.

Parameters:

Name Type Description Default
accuracy Accuracy | None
None
distances Distances | None
None
similarity Similarity | None
None

ModelType

The type of model.

  • TABULAR: A generative AI model tailored towards tabular data, trained from scratch.
  • LANGUAGE: A generative AI model build upon a (pre-trained) language model.

Notification

A notification for a user.

Parameters:

Name Type Description Default
id str

The unique identifier of the notification.

required
type NotificationType
required
message str

The message of the notification.

required
status NotificationStatus
required
created_at datetime

The UTC date and time when the notification has been created.

required
resource_uri str | None

The API service endpoint of the entity

None

NotificationStatus

The status of the notification.

NotificationType

The type of the notification

PaginatedTotalCount

Parameters:

Name Type Description Default
root int

The total number of entities within the list

required

ParallelGenerationJobs

Parameters:

Name Type Description Default
current int | None

The number of currently running generation jobs.

None
limit int | None

The maximum number of running generation jobs at any time. If empty, then there is no limit.

None

ParallelTrainingJobs

Parameters:

Name Type Description Default
current int | None

The number of currently running training jobs

None
limit int | None

The maximum number of running training jobs at any time. If empty, then there is no limit.

None

PermissionLevel

The permission level of the user with respect to this resource

  • VIEW: The user can view and use the resource
  • ADMIN: The user can edit, delete and transfer ownership of the resource

Probe

The generated synthetic samples returned as a result of the probe.

Parameters:

Name Type Description Default
name str | None

The name of the table.

None
rows list[dict[str, Any]] | None
None

ProgressStatus

The status of a job or a step.

  • NEW: The job/step is being configured, and has not started yet
  • CONTINUE: The job/step is being configured, but has existing artefacts
  • ON_HOLD: The job/step has been started, but is kept on hold
  • QUEUED: The job/step has been started, and is awaiting for resources to execute
  • IN_PROGRESS: The job/step is currently running
  • DONE: The job/step has finished successfully
  • FAILED: The job/step has failed
  • CANCELED: The job/step has been canceled

ProgressStep

The progress of a step.

Parameters:

Name Type Description Default
id str | None
None
model_label str | None

The unique label for the model, consisting of table name and a suffix for the model type. This will be empty for steps that are not related to a model.

None
compute_name str | None
None
restarts int | None

The number of previous restarts for the corresponding task.

None
step_code StepCode | None
None
start_date datetime | None

The UTC date and time when the job has started. If the job has not started yet, then this is None.

None
end_date datetime | None

The UTC date and time when the job has ended. If the job is still, then this is None.

None
messages list[dict[str, Any]] | None
None
error_message str | None
None
progress ProgressValue | None
None
status ProgressStatus | None
None

ProgressValue

Parameters:

Name Type Description Default
value int | None
None
max int | None
None

RareCategoryReplacementMethod

Specifies how rare categories will be sampled. Only applicable if value protection has been enabled.

  • CONSTANT: Replace rare categories by a constant _RARE_ token.
  • SAMPLE: Replace rare categories by a sample from non-rare categories.

RebalancingConfig

Configure rebalancing.

Parameters:

Name Type Description Default
column str

The name of the column to be rebalanced. Only applicable for a subject table. Only applicable for categorical columns.

required
probabilities dict[str, float]

The target distribution of samples values. The keys are the categorical values, and the values are the probabilities.

required

Similarity

Metrics regarding the similarity of the full joint distributions of samples within an embedding space.

  1. Cosine Similarity: The cosine similarity between the centroids of synthetic and training samples.
  2. Discriminator AUC: The AUC of a discriminative model to distinguish between synthetic and training samples.

The SentenceTransformer model all-MiniLM-L6-v2 is used to compute the embeddings of a string-ified representation of individual records. In case of sequential data the records, that belong to the same group, are being concatenated. We then calculate the cosine similarity between the centroids of the provided datasets within the embedding space.

Again, we expect the similarity metrics to be as close as possible to 1, but not significantly higher than what is measured for the holdout data, as this would again indicate overfitting.

In addition, a discriminative ML model is trained to distinguish between training and synthetic samples. The ability of this model to distinguish between training and synthetic samples is measured by the AUC score. For synthetic data to be considered realistic, the AUC score should be close to 0.5, which indicates that the synthetic data is indistinguishable from the training data.

Parameters:

Name Type Description Default
cosine_similarity_training_synthetic float | None

Cosine similarity between training and synthetic centroids.

None
cosine_similarity_training_holdout float | None

Cosine similarity between training and holdout centroids. Serves as a reference for cosine_similarity_training_synthetic.

None
discriminator_auc_training_synthetic float | None

Cross-validated AUC of a discriminative model to distinguish between training and synthetic samples.

None
discriminator_auc_training_holdout float | None

Cross-validated AUC of a discriminative model to distinguish between training and holdout samples. Serves as a reference for discriminator_auc_training_synthetic.

None

SourceColumn

A column as part of a source table.

Parameters:

Name Type Description Default
id str

The unique identifier of a source column.

required
name str

The name of a source column.

required
included bool

If true, the column will be included in the training. If false, the column will be excluded from the training.

required
model_encoding_type ModelEncodingType
required
value_range SourceColumnValueRange | None
None

SourceColumnConfig

The configuration for a source column when creating a new generator.

Parameters:

Name Type Description Default
name str

The name of a source column.

required
model_encoding_type ModelEncodingType | None
<ModelEncodingType.auto: 'AUTO'>

SourceColumnValueRange

The (privacy-safe) range of values detected within a source column. These values can then be used as seed values for conditional generation. For CATEGORICAL and NUMERIC_DISCRETE encoding types, this will be given as a list of unique values, sorted by popularity. For other NUMERIC and for DATETIME encoding types, this will be given as a min and max value. Note, that this property will only be populated, once the analysis step for the training of the generator has been completed.

Parameters:

Name Type Description Default
min str | None

The minimum value of the column. For dates, this is represented in ISO format.

None
max str | None

The maximum value of the column. For dates, this is represented in ISO format.

None
values list[str] | None

The list of distinct values of the column. Limited to a maximum of 1000 values.

None
has_null bool | None

If true, null value was detected within the column.

None

SourceForeignKey

Parameters:

Name Type Description Default
id str

The unique identifier of a foreign key.

required
column str | None

The column name of a foreign key.

None
referenced_table str

The table name of the referenced table. That table must have a primary key already defined.

required
is_context bool

If true, then the foreign key will be considered as a context relation. Note, that only one foreign key relation per table can be a context relation.

required

SourceForeignKeyConfig

Parameters:

Name Type Description Default
column str

The column name of a foreign key.

required
referenced_table str

The table name of the referenced table. That table must have a primary key already defined.

required
is_context bool | None

If true, then the foreign key will be considered as a context relation. Note, that only one foreign key relation per table can be a context relation.

None

SourceTable

A table as part of a generator.

Parameters:

Name Type Description Default
id str

The unique identifier of a source table.

required
source_connector BaseResource | None
None
location str | None

The location of a source table. Together with the source connector it uniquely identifies a source, and samples data from there.

None
name str

The name of a source table. It must be unique within a generator.

required
primary_key str | None

The column name of the primary key.

None
columns list[SourceColumn]

The columns of this generator table.

required
foreign_keys list[SourceForeignKey] | None

The foreign keys of a table.

None
model_metrics ModelMetrics | None
None
language_model_metrics ModelMetrics | None
None
model_configuration ModelConfiguration | None
None
language_model_configuration ModelConfiguration | None
None
total_rows int | None

The total number of rows in the source table while fetching data for training.

None

SourceTableAddConfig

The configuration for adding a new source table to a generator.

Parameters:

Name Type Description Default
source_connector_id str

The unique identifier of a connector.

required
location str

The location of a source table. Together with the source connector it uniquely identifies a source, and samples data from there.

required
name str | None

The name of a source table. It must be unique within a generator.

None
include_children bool | None

If true, all tables that are referenced by foreign keys will be included. If false, only the selected table will be included.

None
model_configuration ModelConfiguration | None
None
language_model_configuration ModelConfiguration | None
None

SourceTableConfig

The configuration for a source table when creating a new generator.

Parameters:

Name Type Description Default
name str

The name of a source table. It must be unique within a generator.

required
source_connector_id str | None

The unique identifier of a connector.

None
location str | None

The location of a source table. Together with the source connector it uniquely identifies a source, and samples data from there.

None
data str | None

The base64-encoded string derived from a Parquet file containing the specified source table.

None
model_configuration ModelConfiguration | None
None
language_model_configuration ModelConfiguration | None
None
primary_key str | None

The column name of the primary key.

None
foreign_keys list[SourceForeignKeyConfig] | None

The foreign key configurations of this table.

None
columns list[SourceColumnConfig] | None

The column configurations of this table.

None

SyntheticDataset

A synthetic dataset is created based on a trained generator.

It consists of synthetic samples, as well as a quality assurance report.

Parameters:

Name Type Description Default
id str

The unique identifier of a synthetic dataset.

required
generator BaseResource | None
None
metadata Metadata
required
name str

The name of a synthetic dataset.

required
description str | None

The description of a synthetic dataset.

None
generation_status ProgressStatus
required
generation_time datetime | None

The UTC date and time when the generation has finished.

None
tables list[SyntheticTable] | None

The tables of this synthetic dataset.

None
delivery SyntheticDatasetDelivery | None
None
accuracy float | None

The overall accuracy of the trained generator. This is the average of the overall accuracy scores of all trained models.

None
usage SyntheticDatasetUsage | None
None
generation Any | None
None
Generation
cancel
cancel()

Cancel the generation process.

progress
progress()

Retrieve the progress of the generation process.

Returns:

Name Type Description
JobProgress JobProgress

The progress of the generation process.

start
start()

Start the generation process.

wait
wait(progress_bar=True, interval=2)

Poll the generation progress and wait until the process is complete.

Parameters:

Name Type Description Default
progress_bar bool

If true, displays a progress bar.

True
interval float

Interval in seconds to poll the job progress.

2
config
config()

Retrieve writable synthetic dataset properties.

Returns:

Name Type Description
SyntheticDatasetConfig SyntheticDatasetConfig

The synthetic dataset properties as a configuration object.

data
data(return_type='auto')

Download synthetic dataset and return as dictionary of pandas DataFrames.

Parameters:

Name Type Description Default
return_type Literal['auto', 'dict']

The format of the returned data.

'auto'

Returns:

Type Description
DataFrame | dict[str, DataFrame]

Union[pd.DataFrame, dict[str, pd.DataFrame]]: The synthetic dataset as a dictionary of pandas DataFrames.

delete
delete()

Delete the synthetic dataset.

Returns:

Type Description
None

None

download
download(format='PARQUET', file_path=None)

Download synthetic dataset and save to file.

Parameters:

Name Type Description Default
format SyntheticDatasetFormat

The format of the synthetic dataset.

'PARQUET'
file_path str | Path | None

The file path to save the synthetic dataset.

None

Returns:

Type Description
Path

The path to the saved file.

update
update(name=None, description=None, delivery=None)

Update a synthetic dataset with specific parameters.

Parameters:

Name Type Description Default
name str | None

The name of the synthetic dataset.

None
description str | None

The description of the synthetic dataset.

None
delivery SyntheticDatasetDelivery | None

The delivery configuration for the synthetic dataset.

None

SyntheticDatasetConfig

The configuration for creating a new synthetic dataset.

Parameters:

Name Type Description Default
generator_id str | None

The unique identifier of a generator.

None
name str | None

The name of a synthetic dataset.

None
description str | None

The description of a synthetic dataset.

None
tables list[SyntheticTableConfig] | None
None
delivery SyntheticDatasetDelivery | None
None

SyntheticDatasetDelivery

Parameters:

Name Type Description Default
overwrite_tables bool

If true, tables in the destination will be overwritten. If false, any tables exist, the delivery will fail.

required
destination_connector_id str

The unique identifier of a connector.

required
location str

The location for the destination connector.

required

SyntheticDatasetListItem

Essential synthetic dataset details for listings.

Parameters:

Name Type Description Default
id str

The unique identifier of a synthetic dataset.

required
metadata Metadata
required
name str

The name of a synthetic dataset.

required
description str | None

The description of a synthetic dataset.

None
generation_status ProgressStatus
required
generation_time datetime | None

The UTC date and time when the generation has finished.

None
usage SyntheticDatasetUsage | None
None

SyntheticDatasetUsage

Usage statistics of a synthetic dataset.

Parameters:

Name Type Description Default
total_datapoints int | None

The number of datapoints in the synthetic dataset

None
total_credits float | None

The number of credits used for the synthetic dataset

None
total_compute_time int | None

The total compute time in seconds used for generating this synthetic dataset. This is the sum of the compute time of all trained tasks.

None
no_of_shares int | None

Number of shares of this synthetic dataset.

None
no_of_likes int | None

Number of likes of this synthetic dataset.

None

SyntheticProbeConfig

The configuration for probing for new synthetic samples.

Parameters:

Name Type Description Default
generator_id str | None

The unique identifier of a generator.

None
tables list[SyntheticTableConfig] | None
None

SyntheticTable

A synthetic table that will be generated.

Parameters:

Name Type Description Default
id str | None

The unique identifier of a synthetic table.

None
name str

The name of a source table. It must be unique within a generator.

required
configuration SyntheticTableConfiguration | None
None
model_metrics ModelMetrics | None
None
language_model_metrics ModelMetrics | None
None
foreign_keys list[ForeignKey] | None

The foreign keys of this table.

None
total_rows int | None

The total number of rows for that table in the generated synthetic dataset.

None
total_datapoints int | None

The total number of datapoints for that table in the generated synthetic dataset.

None
source_table_total_rows int | None

The total number of rows in the source table while fetching data for training.

None

SyntheticTableConfig

The configuration for a synthetic table when creating a new synthetic dataset.

Parameters:

Name Type Description Default
name str

The name of a synthetic table. This matches the name of a corresponding SourceTable.

required
configuration SyntheticTableConfiguration | None
None

SyntheticTableConfiguration

The sample configuration for a synthetic table

Parameters:

Name Type Description Default
sample_size int | None

Number of generated samples. Only applicable for subject tables. If neither size nor seed is provided, then the default behavior for Synthetic Datasets is to generate the same size of samples as the original, and the default behavior for Synthetic Datasets is to generate one subject only.

None
sample_seed_connector_id str | None

The connector id of the seed data for conditional generation. Only applicable for subject tables.

None
sample_seed_dict str | None

The base64-encoded string derived from a json line file containing the specified sample seed data.

None
sample_seed_data str | None

The base64-encoded string derived from a Parquet file containing the specified sample seed data.

None
sampling_temperature float | None

temperature for sampling

None
sampling_top_p float | None

topP for sampling

None
rebalancing RebalancingConfig | None
None
imputation ImputationConfig | None
None
fairness FairnessConfig | None
None
tabular_compute str | None
None
language_compute str | None
None

User

A user of the service.

Parameters:

Name Type Description Default
id str | None

The unique identifier of a user.

None
first_name str | None

First name of a user

None
last_name str | None

Last name of a user

None
email str | None

The email of a user

None

UserSettingsAssistantUpdateConfig

Parameters:

Name Type Description Default
about_user_message str | None

The instruction what the Assistant should know about the user to provide better response

None
about_model_message str | None

The instruction how the Assistant should respond

None

UserSettingsProfileUpdateConfig

Parameters:

Name Type Description Default
first_name str | None

First name of a user

None
last_name str | None

Last name of a user

None

UserSettingsUpdateConfig

The configuration for updating user settings.

Parameters:

Name Type Description Default
profile UserSettingsProfileUpdateConfig | None
None
assistant UserSettingsAssistantUpdateConfig | None
None

UserUsage

Parameters:

Name Type Description Default
credits Credits | None
None
parallel_training_jobs ParallelTrainingJobs | None
None
parallel_generation_jobs ParallelGenerationJobs | None
None