6 completed validation reports are available for download, covering consumer, healthcare, and social research domains. Each report includes full distribution tables, metric summaries, and subgroup analyses. Browse all papers and reports →
Validation Framework
Our validation methodology uses a question type-specific approach. Each question type (single-choice, multi-choice, numeric, ranking, text) is evaluated using metrics designed for its data structure. This prevents the use of inappropriate statistical tests and ensures that passing thresholds reflect meaningful agreement.
Every validation study follows a paired-comparison design: one live panel dataset and one synthetic dataset generated from the same survey instrument under identical conditions. The synthetic data is generated without access to live results.
Current Performance: 80-90% of tested questions across studies meet or exceed published benchmarks. Performance varies by question type, domain complexity, and topic sensitivity. All results are reported, including failures.
QA & Outlier Detection
Before any comparison metrics are computed, both live and synthetic datasets undergo quality assurance screening. Outlier detection, straightlining checks, and response consistency validation ensure that comparisons reflect data quality, not noise.
Our QA pipeline applies the same checks to both live and synthetic data, so neither dataset receives preferential treatment. For full details on our quality assurance methodology, see the Quality Assurance documentation.
Understanding Our Metrics with Examples
The following examples illustrate how each metric works in practice, using simplified live vs synthetic comparisons.
KL-Divergence
KL-Divergence measures how one probability distribution diverges from another. A value of 0 means the distributions are identical. Lower values indicate better alignment.
Price: 42% | Proximity: 28% | Quality: 18% | Selection: 12%
Price: 40% | Proximity: 30% | Quality: 17% | Selection: 13%
In this example, KL-Divergence = 0.003, well below the 0.10 threshold. The synthetic distribution closely mirrors the live data with only 1-2 percentage point differences per option.
Spearman Rank Correlation
Spearman correlation measures whether the rank ordering of options is preserved between live and synthetic data. A value of 1.0 means perfect rank agreement.
1. Drug A | 2. Drug C | 3. Drug B | 4. Drug D
1. Drug A | 2. Drug C | 3. Drug D | 4. Drug B
Spearman = 0.80. The top two positions match exactly. Positions 3 and 4 are swapped, but the overall ordering is strongly correlated and exceeds the 0.75 threshold.
Top-K Overlap
Top-K measures the proportion of top-ranked items that appear in both the live and synthetic top-K sets. It focuses on whether the model gets the most important options right.
Battery life | Camera quality | Price
Battery life | Price | Camera quality
Top-3 overlap = 1.0 (3/3 items match). The same three features appear in both top-3 sets, even though the internal ordering differs slightly. This exceeds the 0.8 threshold.
BERTScore
BERTScore measures semantic similarity between text responses using contextual embeddings. Unlike exact-match comparison, it captures whether responses convey the same meaning even when worded differently.
"I've been using it for years and it works reliably every time."
"It's a brand I trust because of its consistent performance over time."
BERTScore F1 = 0.82. The responses express the same core themes (long-term use, reliability, trust) using different wording. This exceeds the 0.75 threshold.
Performance Benchmarks
The table below summarizes the metrics, thresholds, achieved benchmarks, and practical notes for each question type.
| Question Type | Metrics | Threshold | Achieved Benchmark | Notes |
|---|---|---|---|---|
| Single-choice | KL-Divergence, JS-Divergence | KL < 0.10, JS < 0.05 | KL ~0.03-0.08 | Likert collapsing permitted; strongest performance category |
| Multi-choice | JS-Divergence, Spearman, Top-K | JS < 0.05, Spearman > 0.75, Top-K > 0.8 | JS ~0.02-0.04, Spearman ~0.80-0.92 | Per-option binaries; performance scales with number of options |
| Numeric (binned) | KL-Divergence, JS-Divergence | KL < 0.10, JS < 0.05 | KL ~0.04-0.09 | Bins pre-declared; accuracy depends on bin granularity |
| Percent-allocation | KL-Divergence, JS-Divergence, Top-K | KL < 0.10, JS < 0.05, Top-K > 0.8 | KL ~0.05-0.09, Top-K ~0.85 | Dominant allocations tracked; more options increase difficulty |
| Ranking | Spearman, Top-K | Spearman > 0.75, Top-K > 0.8 | Spearman ~0.78-0.90 | Metrics scale with list length; top positions most accurate |
| Text responses | BERTScore F1, Optimal Matching | F1 > 0.75, OMS > 0.75 | F1 ~0.78-0.85 | Semantic similarity; factual questions outperform emotional ones |
How to Read Our Reports
Each validation report follows a standardized structure. Here is what to look for in each section.
- Distribution Tables: Side-by-side percentage distributions for live and synthetic data on every question. Look for percentage point differences and the direction of any bias.
- Sample Sizes: Live (N) and synthetic (N) reported per question and per subgroup. Subgroup analyses with N < 50 are flagged as directional only.
- Metric Summary: A pass/fail table showing each metric, its computed value, the threshold, and the result. Failed metrics are highlighted with likely cause analysis.
- Confidence Intervals: 95% bootstrap confidence intervals on all key metrics. Overlapping CIs between live and synthetic distributions indicate non-significant differences.
- Subgroup Analysis: Key demographic subgroups (age, gender, income, region) analyzed separately. Subgroup performance may vary from overall results, and these differences are documented.
Limitations & Transparency
We document both strengths and limitations in every validation study. Understanding where synthetic data works well and where it falls short is essential for responsible application.
- Topic sensitivity: Questions involving trauma, stigma, or strong social desirability bias show reduced accuracy.
- Rare populations: Very small demographic segments (<2% of population) may not be well-represented in training data.
- Temporal context: Models reflect their training data period. Rapid attitude shifts may not be captured until recalibration.
- Geographic scope: Current validation is primarily U.S.-based. International applications require separate validation.
- Interaction effects: Complex multi-way demographic interactions may be attenuated in synthetic data.
Transparency commitment: We publish all validation results, including failures. Every report includes a limitations section with specific guidance on where synthetic data should and should not be used for that study's domain and topic area.