Schema References for mostlyai.domain
¶
This module is auto-generated to represent pydantic
-based classes of the defined schema in the Public API.
mostlyai.domain ¶
AboutService ¶
General information about the service.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version
|
str | None
|
The version number of the service. |
None
|
assistant
|
bool | None
|
A flag indicating if the assistant is enabled. |
None
|
Accuracy ¶
Metrics regarding the accuracy of synthetic data, measured as the closeness of discretized lower dimensional marginal distributions.
- Univariate Accuracy: The accuracy of the univariate distributions for all target columns.
- Bivariate Accuracy: The accuracy of all pair-wise distributions for target columns, as well as for target columns with respect to the context columns.
- Coherence Accuracy: The accuracy of the auto-correlation for all target columns.
Accuracy is defined as 100% - Total Variation Distance (TVD), whereas TVD is half the sum of the absolute differences of the relative frequencies of the corresponding distributions.
These accuracies are calculated for all discretized univariate, and bivariate distributions. In case of sequential data, also for all coherence distributions. Overall metrics are then calculated as the average across these accuracies.
All metrics can be compared against a theoretical maximum accuracy, which is calculated for a same-sized holdout. The accuracy metrics shall be as close as possible to the theoretical maximum, but not significantly higher, as this would indicate overfitting.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
overall
|
float | None
|
Overall accuracy of synthetic data, averaged across univariate, bivariate, and coherence. |
None
|
univariate
|
float | None
|
Average accuracy of discretized univariate distributions. |
None
|
bivariate
|
float | None
|
Average accuracy of discretized bivariate distributions. |
None
|
coherence
|
float | None
|
Average accuracy of discretized coherence distributions. Only applicable for sequential data. |
None
|
overall_max
|
float | None
|
Expected overall accuracy of a same-sized holdout. Serves as a reference for |
None
|
univariate_max
|
float | None
|
Expected univariate accuracy of a same-sized holdout. Serves as a reference for |
None
|
bivariate_max
|
float | None
|
Expected bivariate accuracy of a same-sized holdout. Serves as a reference for |
None
|
coherence_max
|
float | None
|
Expected coherence accuracy of a same-sized holdout. Serves as a reference for |
None
|
BaseResource ¶
A set of common properties across resources.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str | None
|
The unique identifier of the entity. |
None
|
name
|
str | None
|
The name of the entity. |
None
|
uri
|
str | None
|
The API service endpoint of the entity |
None
|
current_user_permission_level
|
PermissionLevel | None
|
|
None
|
current_user_like_status
|
bool | None
|
A boolean indicating whether the user has liked the entity or not |
None
|
Compute ¶
A compute resource for executing tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str | None
|
|
None
|
name
|
str | None
|
|
None
|
type
|
ComputeType | None
|
|
None
|
config
|
dict[str, Any] | None
|
|
None
|
secrets
|
dict[str, Any] | None
|
|
None
|
resources
|
ComputeResources | None
|
|
None
|
order_index
|
int | None
|
The index for determining the sort order when listing computes |
None
|
ComputeConfig ¶
The configuration for creating a new compute resource.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str | None
|
|
None
|
type
|
ComputeType | None
|
|
None
|
resources
|
ComputeResources | None
|
|
None
|
config
|
dict[str, Any] | None
|
|
None
|
secrets
|
dict[str, Any] | None
|
|
None
|
order_index
|
int | None
|
The index for determining the sort order when listing computes |
None
|
ComputeListItem ¶
Essential compute details for listings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str | None
|
|
None
|
type
|
ComputeType | None
|
|
None
|
name
|
str | None
|
|
None
|
resources
|
ComputeResources | None
|
|
None
|
ComputeResources ¶
A set of available hardware resources for a compute resource.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cpus
|
int | None
|
The number of CPU cores |
None
|
memory
|
float | None
|
The amount of memory in GB |
None
|
gpus
|
int | None
|
The number of GPUs |
0
|
gpu_memory
|
float | None
|
The amount of GPU memory in GB |
0
|
ComputeType ¶
The type of compute.
Connector ¶
A connector is a connection to a data source or a data destination.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
The unique identifier of a connector. |
required |
name
|
str
|
The name of a connector. |
required |
type
|
ConnectorType
|
|
required |
access_type
|
ConnectorAccessType
|
|
required |
config
|
dict[str, Any] | None
|
|
None
|
secrets
|
dict[str, str] | None
|
|
None
|
ssl
|
dict[str, str] | None
|
|
None
|
metadata
|
Metadata | None
|
|
None
|
usage
|
ConnectorUsage | None
|
|
None
|
table_id
|
str | None
|
Optional. ID of a source table or a synthetic table, that this connector belongs to. If not set, then this connector is managed independently of any generator or synthetic dataset. |
None
|
locations ¶
List connector locations.
List the available databases, schemas, tables, or folders for a connector.
For storage connectors, this returns list of folders and files at root, respectively at prefix
level.
For DB connectors, this returns list of schemas (or databases for DBs without schema), respectively list of tables if prefix
is provided.
The formats of the locations are:
- Cloud storage:
AZURE_STORAGE
:container/path
GOOGLE_CLOUD_STORAGE
:bucket/path
S3_STORAGE
:bucket/path
- Database:
BIGQUERY
:dataset.table
DATABRICKS
:schema.table
HIVE
:database.table
MARIADB
:database.table
MSSQL
:schema.table
MYSQL
:database.table
ORACLE
:schema.table
POSTGRES
:schema.table
SNOWFLAKE
:schema.table
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prefix
|
str
|
The prefix to filter the results by. |
''
|
Returns:
Name | Type | Description |
---|---|---|
list |
list
|
A list of locations (schemas, databases, directories, etc.). |
schema ¶
Retrieve the schema of the table at a connector location.
Please refer to locations()
for the format of the location.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
location
|
str
|
The location of the table. |
required |
Returns:
Type | Description |
---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: The retrieved schema. |
update ¶
Update a connector with specific parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str | None
|
The name of the connector. |
None
|
config
|
dict[str, Any]
|
Connector configuration. |
None
|
secrets
|
dict[str, str]
|
Secret values for the connector. |
None
|
ssl
|
dict[str, str]
|
SSL configuration for the connector. |
None
|
test_connection
|
bool | None
|
If true, validates the connection before saving. |
True
|
ConnectorAccessType ¶
The access type of a connector. Either SOURCE
or DESTINATION
.
ConnectorConfig ¶
The structures of the config, secrets and ssl parameters depend on the connector type.
- Cloud storage:
- type: AZURE_STORAGE config: accountName: string clientId: string (required for auth via service principal) tenantId: string (required for auth via service principal) secrets: accountKey: string (required for regular auth) clientSecret: string (required for auth via service principal) - type: GOOGLE_CLOUD_STORAGE config: secrets: keyFile: string - type: S3_STORAGE config: accessKey: string endpointUrl: string (only needed for S3-compatible storage services other than AWS) sslEnabled: boolean, default: false secrets: secretKey: string ssl: caCertificate: base64-encoded string
- Database:
- type: BIGQUERY config: secrets: keyFile: string - type: DATABRICKS config: host: string httpPath: string catalog: string clientId: string (required for auth via service principal) tenantId: string (required for auth via service principal) secrets: accessToken: string (required for regular auth) clientSecret: string (required for auth via service principal) - type: HIVE config: host: string port: integer, default: 10000 username: string (required for regular auth) kerberosEnabled: boolean, default: false kerberosServicePrincipal: string (required if kerberosEnabled) kerberosClientPrincipal: string (optional if kerberosEnabled) kerberosKrb5Conf: string (required if kerberosEnabled) sslEnabled: boolean, default: false secrets: password: string (required for regular auth) kerberosKeytab: base64-encoded string (required if kerberosEnabled) ssl: caCertificate: base64-encoded string - type: MARIADB config: host: string port: integer, default: 3306 username: string secrets: password: string - type: MSSQL config: host: string port: integer, default: 1433 username: string database: string secrets: password: string - type: MYSQL config: host: string port: integer, default: 3306 username: string secrets: password: string - type: ORACLE config: host: string port: integer, default: 1521 username: string connectionType: enum {SID, SERVICE_NAME}, default: SID database: string, default: ORCL secrets: password: string - type: POSTGRES config: host: string port: integer, default: 5432 username: string database: string sslEnabled: boolean, default: false secrets: password: string ssl: rootCertificate: base64-encoded string sslCertificate: base64-encoded string sslCertificateKey: base64-encoded string - type: SNOWFLAKE config: account: string username: string warehouse: string, default: COMPUTE_WH database: string secrets: password: string
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str | None
|
The name of a connector. |
None
|
type
|
ConnectorType
|
|
required |
access_type
|
ConnectorAccessType | None
|
|
<ConnectorAccessType.source: 'SOURCE'>
|
config
|
dict[str, Any] | None
|
|
None
|
secrets
|
dict[str, str] | None
|
|
None
|
ssl
|
dict[str, str] | None
|
|
None
|
ConnectorListItem ¶
Essential connector details for listings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
The unique identifier of a connector. |
required |
name
|
str
|
The name of a connector. |
required |
type
|
ConnectorType
|
|
required |
access_type
|
ConnectorAccessType
|
|
required |
metadata
|
Metadata
|
|
required |
usage
|
ConnectorUsage | None
|
|
None
|
ConnectorType ¶
The type of a connector.
The type determines the structure of the config, secrets and ssl parameters.
MYSQL
: MySQL databasePOSTGRES
: PostgreSQL databaseMSSQL
: Microsoft SQL Server databaseORACLE
: Oracle databaseMARIADB
: MariaDB databaseSNOWFLAKE
: Snowflake cloud data platformBIGQUERY
: Google BigQuery cloud data warehouseHIVE
: Apache Hive databaseDATABRICKS
: Databricks cloud data platformAZURE_STORAGE
: Azure Blob StorageGOOGLE_CLOUD_STORAGE
: Google Cloud StorageS3_STORAGE
: Amazon S3 StorageFILE_UPLOAD
: File upload
ConnectorUsage ¶
Usage statistics of a connector.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
no_of_shares
|
int | None
|
Number of shares of this connector. |
None
|
no_of_generators
|
int | None
|
Number of generators using this connector. |
None
|
Credits ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
current
|
float | None
|
The credit balance for the current time period |
None
|
limit
|
float | None
|
The credit limit for the current time period. If empty, then there is no limit. |
None
|
period_start
|
datetime | None
|
The UTC date and time when the current time period started |
None
|
period_end
|
datetime | None
|
The UTC date and time when the current time period ends |
None
|
CurrentUser ¶
Information on the current user.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str | None
|
The unique identifier of a user. |
None
|
first_name
|
str | None
|
First name of a user |
None
|
last_name
|
str | None
|
Last name of a user |
None
|
email
|
str | None
|
The email of a user |
None
|
settings
|
dict[str, Any] | None
|
|
None
|
usage
|
UserUsage | None
|
|
None
|
unread_notifications
|
int | None
|
Number of unread notifications for the user |
None
|
DifferentialPrivacyConfig ¶
The optional differential privacy configuration for training the model. If not provided, then no differential privacy will be applied.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_epsilon
|
float | None
|
Specifies the maximum allowable epsilon value. If the training process exceeds this threshold, it will be terminated early. Only model checkpoints with epsilon values below this limit will be retained. If not provided, the training will proceed without early termination based on epsilon constraints. |
None
|
noise_multiplier
|
float | None
|
The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (How much noise to add). |
1.5
|
max_grad_norm
|
float | None
|
The maximum norm of the per-sample gradients for training the model with differential privacy. |
1.0
|
Distances ¶
Metrics regarding the nearest neighbor distances between training, holdout, and synthetic samples in an embedding space. Useful for assessing the novelty / privacy of synthetic data.
The provided data is first down-sampled, so that the number of samples match across all datasets. Note, that for an optimal sensitivity of this privacy assessment it is recommended to use a 50/50 split between training and holdout data, and then generate synthetic data of the same size.
The embeddings of these samples are then computed, and the L2 nearest neighbor distances are calculated for each synthetic sample to the training and holdout samples. Based on these nearest neighbor distances the following metrics are calculated: - Identical Match Share (IMS): The share of synthetic samples that are identical to a training or holdout sample. - Distance to Closest Record (DCR): The average distance of synthetic to training or holdout samples.
For privacy-safe synthetic data we expect to see about as many identical matches, and about the same distances for synthetic samples to training, as we see for synthetic samples to holdout.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ims_training
|
float | None
|
Share of synthetic samples that are identical to a training sample. |
None
|
ims_holdout
|
float | None
|
Share of synthetic samples that are identical to a holdout sample. Serves as a reference for |
None
|
dcr_training
|
float | None
|
Average L2 nearest-neighbor distance between synthetic and training samples. |
None
|
dcr_holdout
|
float | None
|
Average L2 nearest-neighbor distance between synthetic and holdout samples. Serves as a reference for |
None
|
dcr_share
|
float | None
|
Share of synthetic samples that are closer to a training sample than to a holdout sample. This should not be significantly larger than 50%. |
None
|
FairnessConfig ¶
Configure a fairness objective for the table. Only applicable for a subject table. The generated synthetic data will maintain robust statistical parity between the target column and the specified sensitive columns. All these columns must be categorical.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target_column
|
str
|
The name of the target column. |
required |
sensitive_columns
|
list[str]
|
The names of the sensitive columns. |
required |
ForeignKey ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The column name of a foreign key. |
required |
referenced_table
|
str
|
The table name of the referenced table. That table must have a primary key already defined. |
required |
is_context
|
bool
|
If true, then the foreign key will be considered as a context relation. Note, that only one foreign key relation per table can be a context relation. |
required |
Generator ¶
A generator is a set models that can generate synthetic data.
The generator can be trained on one or more source tables. A quality assurance report is generated for each model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
The unique identifier of a generator. |
required |
name
|
str | None
|
The name of a generator. |
None
|
description
|
str | None
|
The description of a generator. |
None
|
training_status
|
ProgressStatus
|
|
required |
training_time
|
datetime | None
|
The UTC date and time when the training has finished. |
None
|
usage
|
GeneratorUsage | None
|
|
None
|
metadata
|
Metadata
|
|
required |
accuracy
|
float | None
|
The overall accuracy of the trained generator. This is the average of the overall accuracy scores of all trained models. |
None
|
tables
|
list[SourceTable] | None
|
The tables of this generator |
None
|
training
|
Any | None
|
|
None
|
Training ¶
progress ¶
Retrieve job progress of training.
Returns:
Name | Type | Description |
---|---|---|
JobProgress |
JobProgress
|
The job progress of the training process. |
wait ¶
Poll training progress and loop until training has completed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
progress_bar
|
bool
|
If true, displays the progress bar. |
True
|
interval
|
float
|
The interval in seconds to poll the job progress. |
2
|
clone ¶
Clone the generator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
training_status
|
Literal['NEW', 'CONTINUE']
|
The training status of the cloned generator. |
'NEW'
|
Returns:
Name | Type | Description |
---|---|---|
Generator |
Generator
|
The cloned generator object. |
config ¶
Retrieve writable generator properties.
Returns:
Name | Type | Description |
---|---|---|
GeneratorConfig |
GeneratorConfig
|
The generator properties as a configuration object. |
export_to_file ¶
Export generator and save to file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str | Path | None
|
The file path to save the generator. |
None
|
Returns:
Type | Description |
---|---|
Path
|
The path to the saved file. |
update ¶
Update a generator with specific parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str | None
|
The name of the generator. |
None
|
description
|
str | None
|
The description of the generator. |
None
|
GeneratorCloneTrainingStatus ¶
The training status of the new generator. The available options are:
NEW
: The new generator will re-use existing data and model configurations.CONTINUE
: The new generator will re-use existing data and model configurations, as well as model weights.
GeneratorConfig ¶
The configuration for creating a new generator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str | None
|
The name of a generator. |
None
|
description
|
str | None
|
The description of a generator. |
None
|
tables
|
list[SourceTableConfig] | None
|
The tables of a generator |
None
|
GeneratorImportFromFileConfig ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
bytes
|
|
required |
GeneratorListItem ¶
Essential generator details for listings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
The unique identifier of a generator. |
required |
name
|
str | None
|
The name of a generator. |
None
|
description
|
str | None
|
The description of a generator. |
None
|
training_status
|
ProgressStatus
|
|
required |
training_time
|
datetime | None
|
The UTC date and time when the training has finished. |
None
|
usage
|
GeneratorUsage | None
|
|
None
|
metadata
|
Metadata
|
|
required |
GeneratorUsage ¶
Usage statistics of a generator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
total_datapoints
|
int | None
|
The total number of datapoints generated by this generator. |
None
|
total_compute_time
|
int | None
|
The total compute time in seconds used for training this generator. This is the sum of the compute time of all trained tasks. |
None
|
no_of_synthetic_datasets
|
int | None
|
Number of synthetic datasets generated by this generator. |
None
|
no_of_shares
|
int | None
|
Number of shares of this generator. |
None
|
no_of_likes
|
int | None
|
Number of likes of this generator. |
None
|
ImputationConfig ¶
Configure imputation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
list[str]
|
The names of the columns to be imputed. Imputed columns will suppress the sampling of NULL values. |
required |
JobProgress ¶
The progress of a job.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str | None
|
|
None
|
start_date
|
datetime | None
|
The UTC date and time when the job has started. If the job has not started yet, then this is None. |
None
|
end_date
|
datetime | None
|
The UTC date and time when the job has ended. If the job is still, then this is None. |
None
|
progress
|
ProgressValue | None
|
|
None
|
status
|
ProgressStatus | None
|
|
None
|
steps
|
list[ProgressStep] | None
|
|
None
|
Metadata ¶
The metadata of a resource.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
created_at
|
datetime | None
|
The UTC date and time when the resource has been created. |
None
|
owner_id
|
str | None
|
The unique identifier of the owner of the entity. |
None
|
owner_name
|
str | None
|
The name of the owner of the entity. |
None
|
current_user_permission_level
|
PermissionLevel | None
|
|
None
|
current_user_like_status
|
bool | None
|
A boolean indicating whether the user has liked the entity or not |
None
|
short_lived_file_token
|
str | None
|
An auto-generated short-lived file token ( |
None
|
ModelConfiguration ¶
The training configuration for the model
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str | None
|
The model to be used for training. |
None
|
max_sample_size
|
int | None
|
The maximum number of samples to consider for training. If not provided, then all available samples will be taken. |
None
|
batch_size
|
int | None
|
The batch size used for training the model. If not provided, batchSize will be chosen automatically. |
None
|
max_training_time
|
float | None
|
The maximum number of minutes to train the model. |
10
|
max_epochs
|
float | None
|
The maximum number of epochs to train the model. |
100
|
max_sequence_window
|
int | None
|
The maximum sequence window to consider for training. Only applicable for TABULAR models. |
100
|
enable_flexible_generation
|
bool | None
|
If true, then the trained generator can be used for rebalancing and imputation. Only applicable for TABULAR models. |
True
|
value_protection
|
bool | None
|
Defines if Rare Category, Extreme value, or Sequence length protection will be applied. |
True
|
rare_category_replacement_method
|
RareCategoryReplacementMethod | None
|
Specifies how rare categories will be sampled. Only applicable if value protection has been enabled.
|
<RareCategoryReplacementMethod.constant: 'CONSTANT'>
|
differential_privacy
|
DifferentialPrivacyConfig | None
|
|
None
|
compute
|
str | None
|
|
None
|
ModelEncodingType ¶
The encoding type used for model training and data generation.
AUTO
: Model chooses among available encoding types based on the column's data type.TABULAR_CATEGORICAL
: Model samples from existing (non-rare) categories.TABULAR_NUMERIC_AUTO
: Model chooses among 3 numeric encoding types based on the values.TABULAR_NUMERIC_DISCRETE
: Model samples from existing discrete numerical values.TABULAR_NUMERIC_BINNED
: Model samples from binned buckets, to then sample randomly within a bucket.TABULAR_NUMERIC_DIGIT
: Model samples each digit of a numerical value.TABULAR_CHARACTER
: Model samples each character of a string value.TABULAR_DATETIME
: Model samples each part of a datetime value.TABULAR_DATETIME_RELATIVE
: Model samples the relative difference between datetimes within a sequence.TABULAR_LAT_LONG
: Model samples a latitude-longitude column. The format is "latitude,longitude".LANGUAGE_TEXT
: Model will train a distinct LANGUAGE model for this column, to then generate free text.
Encoding types, that are not being prefixed with either TABULAR
or LANGUAGE
have been deprecated.
ModelMetrics ¶
Metrics regarding the quality of synthetic data, measured in terms of accuracy, similarity, and distances.
- Accuracy: Metrics regarding the accuracy of synthetic data, measured as the closeness of discretized lower dimensional marginal distributions.
- Similarity: Metrics regarding the similarity of the full joint distributions of samples within an embedding space.
- Distances: Metrics regarding the nearest neighbor distances between training, holdout, and synthetic samples in an embedding space. Useful for assessing the novelty / privacy of synthetic data.
The quality of synthetic data is assessed by comparing these metrics to the same metrics of a holdout dataset. The holdout dataset is a subset of the original training data, that was not used for training the synthetic data generator. The metrics of the synthetic data should be as close as possible to the metrics of the holdout data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
accuracy
|
Accuracy | None
|
|
None
|
distances
|
Distances | None
|
|
None
|
similarity
|
Similarity | None
|
|
None
|
ModelType ¶
The type of model.
TABULAR
: A generative AI model tailored towards tabular data, trained from scratch.LANGUAGE
: A generative AI model build upon a (pre-trained) language model.
Notification ¶
A notification for a user.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
The unique identifier of the notification. |
required |
type
|
NotificationType
|
|
required |
message
|
str
|
The message of the notification. |
required |
status
|
NotificationStatus
|
|
required |
created_at
|
datetime
|
The UTC date and time when the notification has been created. |
required |
resource_uri
|
str | None
|
The API service endpoint of the entity |
None
|
NotificationStatus ¶
The status of the notification.
NotificationType ¶
The type of the notification
PaginatedTotalCount ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root
|
int
|
The total number of entities within the list |
required |
ParallelGenerationJobs ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
current
|
int | None
|
The number of currently running generation jobs. |
None
|
limit
|
int | None
|
The maximum number of running generation jobs at any time. If empty, then there is no limit. |
None
|
ParallelTrainingJobs ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
current
|
int | None
|
The number of currently running training jobs |
None
|
limit
|
int | None
|
The maximum number of running training jobs at any time. If empty, then there is no limit. |
None
|
PermissionLevel ¶
The permission level of the user with respect to this resource
VIEW
: The user can view and use the resourceADMIN
: The user can edit, delete and transfer ownership of the resource
Probe ¶
The generated synthetic samples returned as a result of the probe.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str | None
|
The name of the table. |
None
|
rows
|
list[dict[str, Any]] | None
|
|
None
|
ProgressStatus ¶
The status of a job or a step.
NEW
: The job/step is being configured, and has not started yetCONTINUE
: The job/step is being configured, but has existing artefactsON_HOLD
: The job/step has been started, but is kept on holdQUEUED
: The job/step has been started, and is awaiting for resources to executeIN_PROGRESS
: The job/step is currently runningDONE
: The job/step has finished successfullyFAILED
: The job/step has failedCANCELED
: The job/step has been canceled
ProgressStep ¶
The progress of a step.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str | None
|
|
None
|
model_label
|
str | None
|
The unique label for the model, consisting of table name and a suffix for the model type. This will be empty for steps that are not related to a model. |
None
|
compute_name
|
str | None
|
|
None
|
restarts
|
int | None
|
The number of previous restarts for the corresponding task. |
None
|
step_code
|
StepCode | None
|
|
None
|
start_date
|
datetime | None
|
The UTC date and time when the job has started. If the job has not started yet, then this is None. |
None
|
end_date
|
datetime | None
|
The UTC date and time when the job has ended. If the job is still, then this is None. |
None
|
messages
|
list[dict[str, Any]] | None
|
|
None
|
error_message
|
str | None
|
|
None
|
progress
|
ProgressValue | None
|
|
None
|
status
|
ProgressStatus | None
|
|
None
|
ProgressValue ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
value
|
int | None
|
|
None
|
max
|
int | None
|
|
None
|
RareCategoryReplacementMethod ¶
Specifies how rare categories will be sampled. Only applicable if value protection has been enabled.
CONSTANT
: Replace rare categories by a constant_RARE_
token.SAMPLE
: Replace rare categories by a sample from non-rare categories.
RebalancingConfig ¶
Configure rebalancing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The name of the column to be rebalanced. Only applicable for a subject table. Only applicable for categorical columns. |
required |
probabilities
|
dict[str, float]
|
The target distribution of samples values. The keys are the categorical values, and the values are the probabilities. |
required |
Similarity ¶
Metrics regarding the similarity of the full joint distributions of samples within an embedding space.
- Cosine Similarity: The cosine similarity between the centroids of synthetic and training samples.
- Discriminator AUC: The AUC of a discriminative model to distinguish between synthetic and training samples.
The SentenceTransformer model all-MiniLM-L6-v2 is used to compute the embeddings of a string-ified representation of individual records. In case of sequential data the records, that belong to the same group, are being concatenated. We then calculate the cosine similarity between the centroids of the provided datasets within the embedding space.
Again, we expect the similarity metrics to be as close as possible to 1, but not significantly higher than what is measured for the holdout data, as this would again indicate overfitting.
In addition, a discriminative ML model is trained to distinguish between training and synthetic samples. The ability of this model to distinguish between training and synthetic samples is measured by the AUC score. For synthetic data to be considered realistic, the AUC score should be close to 0.5, which indicates that the synthetic data is indistinguishable from the training data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cosine_similarity_training_synthetic
|
float | None
|
Cosine similarity between training and synthetic centroids. |
None
|
cosine_similarity_training_holdout
|
float | None
|
Cosine similarity between training and holdout centroids. Serves as a reference for |
None
|
discriminator_auc_training_synthetic
|
float | None
|
Cross-validated AUC of a discriminative model to distinguish between training and synthetic samples. |
None
|
discriminator_auc_training_holdout
|
float | None
|
Cross-validated AUC of a discriminative model to distinguish between training and holdout samples. Serves as a reference for |
None
|
SourceColumn ¶
A column as part of a source table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
The unique identifier of a source column. |
required |
name
|
str
|
The name of a source column. |
required |
included
|
bool
|
If true, the column will be included in the training. If false, the column will be excluded from the training. |
required |
model_encoding_type
|
ModelEncodingType
|
|
required |
value_range
|
SourceColumnValueRange | None
|
|
None
|
SourceColumnConfig ¶
The configuration for a source column when creating a new generator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of a source column. |
required |
model_encoding_type
|
ModelEncodingType | None
|
|
<ModelEncodingType.auto: 'AUTO'>
|
SourceColumnValueRange ¶
The (privacy-safe) range of values detected within a source column. These values can then be used as seed values for conditional generation. For CATEGORICAL and NUMERIC_DISCRETE encoding types, this will be given as a list of unique values, sorted by popularity. For other NUMERIC and for DATETIME encoding types, this will be given as a min and max value. Note, that this property will only be populated, once the analysis step for the training of the generator has been completed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
min
|
str | None
|
The minimum value of the column. For dates, this is represented in ISO format. |
None
|
max
|
str | None
|
The maximum value of the column. For dates, this is represented in ISO format. |
None
|
values
|
list[str] | None
|
The list of distinct values of the column. Limited to a maximum of 1000 values. |
None
|
has_null
|
bool | None
|
If true, null value was detected within the column. |
None
|
SourceForeignKey ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
The unique identifier of a foreign key. |
required |
column
|
str | None
|
The column name of a foreign key. |
None
|
referenced_table
|
str
|
The table name of the referenced table. That table must have a primary key already defined. |
required |
is_context
|
bool
|
If true, then the foreign key will be considered as a context relation. Note, that only one foreign key relation per table can be a context relation. |
required |
SourceForeignKeyConfig ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The column name of a foreign key. |
required |
referenced_table
|
str
|
The table name of the referenced table. That table must have a primary key already defined. |
required |
is_context
|
bool | None
|
If true, then the foreign key will be considered as a context relation. Note, that only one foreign key relation per table can be a context relation. |
None
|
SourceTable ¶
A table as part of a generator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
The unique identifier of a source table. |
required |
source_connector
|
BaseResource | None
|
|
None
|
location
|
str | None
|
The location of a source table. Together with the source connector it uniquely identifies a source, and samples data from there. |
None
|
name
|
str
|
The name of a source table. It must be unique within a generator. |
required |
primary_key
|
str | None
|
The column name of the primary key. |
None
|
columns
|
list[SourceColumn]
|
The columns of this generator table. |
required |
foreign_keys
|
list[SourceForeignKey] | None
|
The foreign keys of a table. |
None
|
model_metrics
|
ModelMetrics | None
|
|
None
|
language_model_metrics
|
ModelMetrics | None
|
|
None
|
model_configuration
|
ModelConfiguration | None
|
|
None
|
language_model_configuration
|
ModelConfiguration | None
|
|
None
|
total_rows
|
int | None
|
The total number of rows in the source table while fetching data for training. |
None
|
SourceTableAddConfig ¶
The configuration for adding a new source table to a generator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source_connector_id
|
str
|
The unique identifier of a connector. |
required |
location
|
str
|
The location of a source table. Together with the source connector it uniquely identifies a source, and samples data from there. |
required |
name
|
str | None
|
The name of a source table. It must be unique within a generator. |
None
|
include_children
|
bool | None
|
If true, all tables that are referenced by foreign keys will be included. If false, only the selected table will be included. |
None
|
model_configuration
|
ModelConfiguration | None
|
|
None
|
language_model_configuration
|
ModelConfiguration | None
|
|
None
|
SourceTableConfig ¶
The configuration for a source table when creating a new generator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of a source table. It must be unique within a generator. |
required |
source_connector_id
|
str | None
|
The unique identifier of a connector. |
None
|
location
|
str | None
|
The location of a source table. Together with the source connector it uniquely identifies a source, and samples data from there. |
None
|
data
|
str | None
|
The base64-encoded string derived from a Parquet file containing the specified source table. |
None
|
model_configuration
|
ModelConfiguration | None
|
|
None
|
language_model_configuration
|
ModelConfiguration | None
|
|
None
|
primary_key
|
str | None
|
The column name of the primary key. |
None
|
foreign_keys
|
list[SourceForeignKeyConfig] | None
|
The foreign key configurations of this table. |
None
|
columns
|
list[SourceColumnConfig] | None
|
The column configurations of this table. |
None
|
SyntheticDataset ¶
A synthetic dataset is created based on a trained generator.
It consists of synthetic samples, as well as a quality assurance report.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
The unique identifier of a synthetic dataset. |
required |
generator
|
BaseResource | None
|
|
None
|
metadata
|
Metadata
|
|
required |
name
|
str
|
The name of a synthetic dataset. |
required |
description
|
str | None
|
The description of a synthetic dataset. |
None
|
generation_status
|
ProgressStatus
|
|
required |
generation_time
|
datetime | None
|
The UTC date and time when the generation has finished. |
None
|
tables
|
list[SyntheticTable] | None
|
The tables of this synthetic dataset. |
None
|
delivery
|
SyntheticDatasetDelivery | None
|
|
None
|
accuracy
|
float | None
|
The overall accuracy of the trained generator. This is the average of the overall accuracy scores of all trained models. |
None
|
usage
|
SyntheticDatasetUsage | None
|
|
None
|
generation
|
Any | None
|
|
None
|
Generation ¶
progress ¶
Retrieve the progress of the generation process.
Returns:
Name | Type | Description |
---|---|---|
JobProgress |
JobProgress
|
The progress of the generation process. |
wait ¶
Poll the generation progress and wait until the process is complete.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
progress_bar
|
bool
|
If true, displays a progress bar. |
True
|
interval
|
float
|
Interval in seconds to poll the job progress. |
2
|
config ¶
Retrieve writable synthetic dataset properties.
Returns:
Name | Type | Description |
---|---|---|
SyntheticDatasetConfig |
SyntheticDatasetConfig
|
The synthetic dataset properties as a configuration object. |
data ¶
Download synthetic dataset and return as dictionary of pandas DataFrames.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
return_type
|
Literal['auto', 'dict']
|
The format of the returned data. |
'auto'
|
Returns:
Type | Description |
---|---|
DataFrame | dict[str, DataFrame]
|
Union[pd.DataFrame, dict[str, pd.DataFrame]]: The synthetic dataset as a dictionary of pandas DataFrames. |
download ¶
Download synthetic dataset and save to file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
format
|
SyntheticDatasetFormat
|
The format of the synthetic dataset. |
'PARQUET'
|
file_path
|
str | Path | None
|
The file path to save the synthetic dataset. |
None
|
Returns:
Type | Description |
---|---|
Path
|
The path to the saved file. |
update ¶
Update a synthetic dataset with specific parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str | None
|
The name of the synthetic dataset. |
None
|
description
|
str | None
|
The description of the synthetic dataset. |
None
|
delivery
|
SyntheticDatasetDelivery | None
|
The delivery configuration for the synthetic dataset. |
None
|
SyntheticDatasetConfig ¶
The configuration for creating a new synthetic dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
generator_id
|
str | None
|
The unique identifier of a generator. |
None
|
name
|
str | None
|
The name of a synthetic dataset. |
None
|
description
|
str | None
|
The description of a synthetic dataset. |
None
|
tables
|
list[SyntheticTableConfig] | None
|
|
None
|
delivery
|
SyntheticDatasetDelivery | None
|
|
None
|
SyntheticDatasetDelivery ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
overwrite_tables
|
bool
|
If true, tables in the destination will be overwritten. If false, any tables exist, the delivery will fail. |
required |
destination_connector_id
|
str
|
The unique identifier of a connector. |
required |
location
|
str
|
The location for the destination connector. |
required |
SyntheticDatasetListItem ¶
Essential synthetic dataset details for listings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
The unique identifier of a synthetic dataset. |
required |
metadata
|
Metadata
|
|
required |
name
|
str
|
The name of a synthetic dataset. |
required |
description
|
str | None
|
The description of a synthetic dataset. |
None
|
generation_status
|
ProgressStatus
|
|
required |
generation_time
|
datetime | None
|
The UTC date and time when the generation has finished. |
None
|
usage
|
SyntheticDatasetUsage | None
|
|
None
|
SyntheticDatasetUsage ¶
Usage statistics of a synthetic dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
total_datapoints
|
int | None
|
The number of datapoints in the synthetic dataset |
None
|
total_credits
|
float | None
|
The number of credits used for the synthetic dataset |
None
|
total_compute_time
|
int | None
|
The total compute time in seconds used for generating this synthetic dataset. This is the sum of the compute time of all trained tasks. |
None
|
no_of_shares
|
int | None
|
Number of shares of this synthetic dataset. |
None
|
no_of_likes
|
int | None
|
Number of likes of this synthetic dataset. |
None
|
SyntheticProbeConfig ¶
The configuration for probing for new synthetic samples.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
generator_id
|
str | None
|
The unique identifier of a generator. |
None
|
tables
|
list[SyntheticTableConfig] | None
|
|
None
|
SyntheticTable ¶
A synthetic table that will be generated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str | None
|
The unique identifier of a synthetic table. |
None
|
name
|
str
|
The name of a source table. It must be unique within a generator. |
required |
configuration
|
SyntheticTableConfiguration | None
|
|
None
|
model_metrics
|
ModelMetrics | None
|
|
None
|
language_model_metrics
|
ModelMetrics | None
|
|
None
|
foreign_keys
|
list[ForeignKey] | None
|
The foreign keys of this table. |
None
|
total_rows
|
int | None
|
The total number of rows for that table in the generated synthetic dataset. |
None
|
total_datapoints
|
int | None
|
The total number of datapoints for that table in the generated synthetic dataset. |
None
|
source_table_total_rows
|
int | None
|
The total number of rows in the source table while fetching data for training. |
None
|
SyntheticTableConfig ¶
The configuration for a synthetic table when creating a new synthetic dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of a synthetic table. This matches the name of a corresponding SourceTable. |
required |
configuration
|
SyntheticTableConfiguration | None
|
|
None
|
SyntheticTableConfiguration ¶
The sample configuration for a synthetic table
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sample_size
|
int | None
|
Number of generated samples. Only applicable for subject tables. If neither size nor seed is provided, then the default behavior for Synthetic Datasets is to generate the same size of samples as the original, and the default behavior for Synthetic Datasets is to generate one subject only. |
None
|
sample_seed_connector_id
|
str | None
|
The connector id of the seed data for conditional generation. Only applicable for subject tables. |
None
|
sample_seed_dict
|
str | None
|
The base64-encoded string derived from a json line file containing the specified sample seed data. |
None
|
sample_seed_data
|
str | None
|
The base64-encoded string derived from a Parquet file containing the specified sample seed data. |
None
|
sampling_temperature
|
float | None
|
temperature for sampling |
None
|
sampling_top_p
|
float | None
|
topP for sampling |
None
|
rebalancing
|
RebalancingConfig | None
|
|
None
|
imputation
|
ImputationConfig | None
|
|
None
|
fairness
|
FairnessConfig | None
|
|
None
|
tabular_compute
|
str | None
|
|
None
|
language_compute
|
str | None
|
|
None
|
User ¶
A user of the service.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str | None
|
The unique identifier of a user. |
None
|
first_name
|
str | None
|
First name of a user |
None
|
last_name
|
str | None
|
Last name of a user |
None
|
email
|
str | None
|
The email of a user |
None
|
UserSettingsAssistantUpdateConfig ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
about_user_message
|
str | None
|
The instruction what the Assistant should know about the user to provide better response |
None
|
about_model_message
|
str | None
|
The instruction how the Assistant should respond |
None
|
UserSettingsProfileUpdateConfig ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
first_name
|
str | None
|
First name of a user |
None
|
last_name
|
str | None
|
Last name of a user |
None
|
UserSettingsUpdateConfig ¶
The configuration for updating user settings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
profile
|
UserSettingsProfileUpdateConfig | None
|
|
None
|
assistant
|
UserSettingsAssistantUpdateConfig | None
|
|
None
|
UserUsage ¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
credits
|
Credits | None
|
|
None
|
parallel_training_jobs
|
ParallelTrainingJobs | None
|
|
None
|
parallel_generation_jobs
|
ParallelGenerationJobs | None
|
|
None
|