Skip to content

API Reference

QA Reference

mostlyai.qa.report

report(
    *,
    syn_tgt_data,
    trn_tgt_data,
    hol_tgt_data=None,
    syn_ctx_data=None,
    trn_ctx_data=None,
    hol_ctx_data=None,
    ctx_primary_key=None,
    tgt_context_key=None,
    report_path="model-report.html",
    report_title="Model Report",
    report_subtitle="",
    report_credits=REPORT_CREDITS,
    report_extra_info="",
    max_sample_size_accuracy=None,
    max_sample_size_embeddings=None,
    statistics_path=None,
    update_progress=None
)

Generate an HTML report and metrics for assessing synthetic data quality.

Compares synthetic data samples with original training samples in terms of accuracy, similarity and distances. Provide holdout samples to calculate reference values for similarity and distances (recommended).

If synthetic data has been generated conditionally on a context dataset, provide the context data as well. This will allow for bivariate accuracy metrics between context and target to be calculated.

If the data represents sequential data, provide the tgt_context_key to set the groupby column for the target data.

Customize the report with the report_title, report_subtitle, report_credits, and report_extra_info.

Limit the compute time used by setting max_sample_size_accuracy and max_sample_size_embeddings.

Parameters:

Name Type Description Default
syn_tgt_data DataFrame

The synthetic (target) data.

required
trn_tgt_data DataFrame

The training (target) data.

required
hol_tgt_data DataFrame | None

The holdout (target) data.

None
syn_ctx_data DataFrame | None

The synthetic context data.

None
trn_ctx_data DataFrame | None

The training context data.

None
hol_ctx_data DataFrame | None

The holdout context data.

None
ctx_primary_key str | None

The primary key of the context data.

None
tgt_context_key str | None

The context key of the target data.

None
report_path str | Path | None

The path to store the HTML report.

'model-report.html'
report_title str

The title of the report.

'Model Report'
report_subtitle str

The subtitle of the report.

''
report_credits str

The credits of the report.

REPORT_CREDITS
report_extra_info str

The extra information of the report.

''
max_sample_size_accuracy int | None

The maximum sample size for accuracy calculations.

None
max_sample_size_embeddings int | None

The maximum sample size for embedding calculations (similarity & distances)

None
statistics_path str | Path | None

The path of where to store the statistics to be used by report_from_statistics

None
update_progress ProgressCallback | None

The progress callback.

None

Returns:

Type Description
Path

The path to the generated HTML report.

ModelMetrics | None

Metrics instance with accuracy, similarity, and distances metrics.

mostlyai.qa.report_from_statistics

report_from_statistics(
    *,
    syn_tgt_data,
    syn_ctx_data=None,
    statistics_path=None,
    ctx_primary_key=None,
    tgt_context_key=None,
    report_path="data-report.html",
    report_title="Data Report",
    report_subtitle="",
    report_credits=REPORT_CREDITS,
    report_extra_info="",
    max_sample_size_accuracy=None,
    max_sample_size_embeddings=None,
    update_progress=None
)

Generate an HTML report based on previously generated statistics and newly provided synthetic data samples.

Parameters:

Name Type Description Default
syn_tgt_data DataFrame

The synthetic (target) data.

required
syn_ctx_data DataFrame | None

The synthetic context data.

None
statistics_path str | Path | None

The path from where to fetch the statistics files.

None
ctx_primary_key str | None

The primary key of the context data.

None
tgt_context_key str | None

The context key of the target data.

None
report_path str | Path | None

The path to store the HTML report.

'data-report.html'
report_title str

The title of the report.

'Data Report'
report_subtitle str

The subtitle of the report.

''
report_credits str

The credits of the report.

REPORT_CREDITS
report_extra_info str

The extra information of the report.

''
max_sample_size_accuracy int | None

The maximum sample size for accuracy calculations.

None
max_sample_size_embeddings int | None

The maximum sample size for embedding calculations (similarity & distances)

None
update_progress ProgressCallback | None

The progress callback.

None

Returns:

Type Description
Path

The path to the generated HTML report.

Metrics Reference

mostlyai.qa.metrics.ModelMetrics

Parameters:

Name Type Description Default
accuracy Accuracy | None

Metrics regarding the accuracy of synthetic data, measured as the closeness of discretized lower dimensional marginal distributions.

None
similarity Similarity | None

Metrics regarding the similarity of the full joint distributions of samples within an embedding space.

None
distances Distances | None

Metrics regarding the nearest neighbor distances between training, holdout, and synthetic samples in an embedding space. Useful for assessing the novelty / privacy of synthetic data.

None

mostlyai.qa.metrics.Accuracy

Metrics regarding the accuracy of synthetic data, measured as the closeness of discretized lower dimensional marginal distributions.

  1. Univariate Accuracy: The accuracy of the univariate distributions for all target columns.
  2. Bivariate Accuracy: The accuracy of all pair-wise distributions for target columns, as well as for target columns with respect to the context columns.
  3. Coherence Accuracy: The accuracy of the auto-correlation for all target columns.

Accuracy is defined as 100% - Total Variation Distance (TVD), whereas TVD is half the sum of the absolute differences of the relative frequencies of the corresponding distributions.

These accuracies are calculated for all discretized univariate, and bivariate distributions. In case of sequential data, also for all coherence distributions. Overall metrics are then calculated as the average across these accuracies.

All metrics can be compared against a theoretical maximum accuracy, which is calculated for a same-sized holdout. The accuracy metrics shall be as close as possible to the theoretical maximum, but not significantly higher, as this would indicate overfitting.

Parameters:

Name Type Description Default
overall float | None

Overall accuracy of synthetic data, averaged across univariate, bivariate, and coherence.

None
univariate float | None

Average accuracy of discretized univariate distributions.

None
bivariate float | None

Average accuracy of discretized bivariate distributions.

None
coherence float | None

Average accuracy of discretized coherence distributions. Only applicable for sequential data.

None
overall_max float | None

Expected overall accuracy of a same-sized holdout. Serves as a reference for overall.

None
univariate_max float | None

Expected univariate accuracy of a same-sized holdout. Serves as a reference for univariate.

None
bivariate_max float | None

Expected bivariate accuracy of a same-sized holdout. Serves as a reference for bivariate.

None
coherence_max float | None

Expected coherence accuracy of a same-sized holdout. Serves as a reference for coherence.

None

mostlyai.qa.metrics.Similarity

Metrics regarding the similarity of the full joint distributions of samples within an embedding space.

  1. Cosine Similarity: The cosine similarity between the centroids of synthetic and training samples.
  2. Discriminator AUC: The AUC of a discriminative model to distinguish between synthetic and training samples.

The SentenceTransformer model all-MiniLM-L6-v2 is used to compute the embeddings of a string-ified representation of individual records. In case of sequential data the records, that belong to the same group, are being concatenated. We then calculate the cosine similarity between the centroids of the provided datasets within the embedding space.

Again, we expect the similarity metrics to be as close as possible to 1, but not significantly higher than what is measured for the holdout data, as this would again indicate overfitting.

In addition, a discriminative ML model is trained to distinguish between training and synthetic samples. The ability of this model to distinguish between training and synthetic samples is measured by the AUC score. For synthetic data to be considered realistic, the AUC score should be close to 0.5, which indicates that the synthetic data is indistinguishable from the training data.

Parameters:

Name Type Description Default
cosine_similarity_training_synthetic float | None

Cosine similarity between training and synthetic centroids.

None
cosine_similarity_training_holdout float | None

Cosine similarity between training and holdout centroids. Serves as a reference for cosine_similarity_training_synthetic.

None
discriminator_auc_training_synthetic float | None

Cross-validated AUC of a discriminative model to distinguish between training and synthetic samples.

None
discriminator_auc_training_holdout float | None

Cross-validated AUC of a discriminative model to distinguish between training and holdout samples. Serves as a reference for discriminator_auc_training_synthetic.

None

mostlyai.qa.metrics.Distances

Metrics regarding the nearest neighbor distances between training, holdout, and synthetic samples in an embedding space. Useful for assessing the novelty / privacy of synthetic data.

The provided data is first down-sampled, so that the number of samples match across all datasets. Note, that for an optimal sensitivity of this privacy assessment it is recommended to use a 50/50 split between training and holdout data, and then generate synthetic data of the same size.

The embeddings of these samples are then computed, and the L2 nearest neighbor distances are calculated for each synthetic sample to the training and holdout samples. Based on these nearest neighbor distances the following metrics are calculated: - Identical Match Share (IMS): The share of synthetic samples that are identical to a training or holdout sample. - Distance to Closest Record (DCR): The average distance of synthetic to training or holdout samples.

For privacy-safe synthetic data we expect to see about as many identical matches, and about the same distances for synthetic samples to training, as we see for synthetic samples to holdout.

Parameters:

Name Type Description Default
ims_training float | None

Share of synthetic samples that are identical to a training sample.

None
ims_holdout float | None

Share of synthetic samples that are identical to a holdout sample. Serves as a reference for ims_training.

None
dcr_training float | None

Average L2 nearest-neighbor distance between synthetic and training samples.

None
dcr_holdout float | None

Average L2 nearest-neighbor distance between synthetic and holdout samples. Serves as a reference for dcr_training.

None
dcr_share float | None

Share of synthetic samples that are closer to a training sample than to a holdout sample. This should not be significantly larger than 50%.

None