Evidence base

Validation Framework & Benchmarks

Statistical validation methodology for comparing live vs synthetic survey data distributions.

6 completed validation reports are available for download, covering consumer, healthcare, and social research domains. Each report includes full distribution tables, metric summaries, and subgroup analyses. Browse all papers and reports →

Validation Framework

Our validation methodology uses a question type-specific approach. Each question type (single-choice, multi-choice, numeric, ranking, text) is evaluated using metrics designed for its data structure. This prevents the use of inappropriate statistical tests and ensures that passing thresholds reflect meaningful agreement.

Every validation study follows a paired-comparison design: one live panel dataset and one synthetic dataset generated from the same survey instrument under identical conditions. The synthetic data is generated without access to live results.

Current Performance: 80-90% of tested questions across studies meet or exceed published benchmarks. Performance varies by question type, domain complexity, and topic sensitivity. All results are reported, including failures.

QA & Outlier Detection

Before any comparison metrics are computed, both live and synthetic datasets undergo quality assurance screening. Outlier detection, straightlining checks, and response consistency validation ensure that comparisons reflect data quality, not noise.

Our QA pipeline applies the same checks to both live and synthetic data, so neither dataset receives preferential treatment. For full details on our quality assurance methodology, see the Quality Assurance documentation.

Understanding Our Metrics with Examples

The following examples illustrate how each metric works in practice, using simplified live vs synthetic comparisons.

KL-Divergence

KL-Divergence measures how one probability distribution diverges from another. A value of 0 means the distributions are identical. Lower values indicate better alignment.

Live distribution
Preferred grocery factor

Price: 42% | Proximity: 28% | Quality: 18% | Selection: 12%

Synthetic distribution
Preferred grocery factor

Price: 40% | Proximity: 30% | Quality: 17% | Selection: 13%

In this example, KL-Divergence = 0.003, well below the 0.10 threshold. The synthetic distribution closely mirrors the live data with only 1-2 percentage point differences per option.

Spearman Rank Correlation

Spearman correlation measures whether the rank ordering of options is preserved between live and synthetic data. A value of 1.0 means perfect rank agreement.

Live ranking
Treatment preference order

1. Drug A | 2. Drug C | 3. Drug B | 4. Drug D

Synthetic ranking
Treatment preference order

1. Drug A | 2. Drug C | 3. Drug D | 4. Drug B

Spearman = 0.80. The top two positions match exactly. Positions 3 and 4 are swapped, but the overall ordering is strongly correlated and exceeds the 0.75 threshold.

Top-K Overlap

Top-K measures the proportion of top-ranked items that appear in both the live and synthetic top-K sets. It focuses on whether the model gets the most important options right.

Live top-3
Most-selected features

Battery life | Camera quality | Price

Synthetic top-3
Most-selected features

Battery life | Price | Camera quality

Top-3 overlap = 1.0 (3/3 items match). The same three features appear in both top-3 sets, even though the internal ordering differs slightly. This exceeds the 0.8 threshold.

BERTScore

BERTScore measures semantic similarity between text responses using contextual embeddings. Unlike exact-match comparison, it captures whether responses convey the same meaning even when worded differently.

Live response
Why did you choose this brand?

"I've been using it for years and it works reliably every time."

Synthetic response
Why did you choose this brand?

"It's a brand I trust because of its consistent performance over time."

BERTScore F1 = 0.82. The responses express the same core themes (long-term use, reliability, trust) using different wording. This exceeds the 0.75 threshold.

Performance Benchmarks

The table below summarizes the metrics, thresholds, achieved benchmarks, and practical notes for each question type.

Question Type Metrics Threshold Achieved Benchmark Notes
Single-choice KL-Divergence, JS-Divergence KL < 0.10, JS < 0.05 KL ~0.03-0.08 Likert collapsing permitted; strongest performance category
Multi-choice JS-Divergence, Spearman, Top-K JS < 0.05, Spearman > 0.75, Top-K > 0.8 JS ~0.02-0.04, Spearman ~0.80-0.92 Per-option binaries; performance scales with number of options
Numeric (binned) KL-Divergence, JS-Divergence KL < 0.10, JS < 0.05 KL ~0.04-0.09 Bins pre-declared; accuracy depends on bin granularity
Percent-allocation KL-Divergence, JS-Divergence, Top-K KL < 0.10, JS < 0.05, Top-K > 0.8 KL ~0.05-0.09, Top-K ~0.85 Dominant allocations tracked; more options increase difficulty
Ranking Spearman, Top-K Spearman > 0.75, Top-K > 0.8 Spearman ~0.78-0.90 Metrics scale with list length; top positions most accurate
Text responses BERTScore F1, Optimal Matching F1 > 0.75, OMS > 0.75 F1 ~0.78-0.85 Semantic similarity; factual questions outperform emotional ones

How to Read Our Reports

Each validation report follows a standardized structure. Here is what to look for in each section.

  • Distribution Tables: Side-by-side percentage distributions for live and synthetic data on every question. Look for percentage point differences and the direction of any bias.
  • Sample Sizes: Live (N) and synthetic (N) reported per question and per subgroup. Subgroup analyses with N < 50 are flagged as directional only.
  • Metric Summary: A pass/fail table showing each metric, its computed value, the threshold, and the result. Failed metrics are highlighted with likely cause analysis.
  • Confidence Intervals: 95% bootstrap confidence intervals on all key metrics. Overlapping CIs between live and synthetic distributions indicate non-significant differences.
  • Subgroup Analysis: Key demographic subgroups (age, gender, income, region) analyzed separately. Subgroup performance may vary from overall results, and these differences are documented.

Limitations & Transparency

We document both strengths and limitations in every validation study. Understanding where synthetic data works well and where it falls short is essential for responsible application.

  • Topic sensitivity: Questions involving trauma, stigma, or strong social desirability bias show reduced accuracy.
  • Rare populations: Very small demographic segments (<2% of population) may not be well-represented in training data.
  • Temporal context: Models reflect their training data period. Rapid attitude shifts may not be captured until recalibration.
  • Geographic scope: Current validation is primarily U.S.-based. International applications require separate validation.
  • Interaction effects: Complex multi-way demographic interactions may be attenuated in synthetic data.

Transparency commitment: We publish all validation results, including failures. Every report includes a limitations section with specific guidance on where synthetic data should and should not be used for that study's domain and topic area.

Run your own validation.

Generate synthetic data from your survey and compare against your live benchmarks.