Validation Framework
Each validation study compares live survey data distributions with synthetic survey data distributions using established statistical divergence measures and correlation tests. Our approach is question type-specific, recognizing that different response formats require different validation metrics.
Current Performance Results: Our synthetic data models currently achieve the benchmarks listed below for approximately 80-90% of tested questions across multiple validation studies. Strongest performance observed on single-choice and ranking formats; comparatively lower alignment on open text responses and highly sensitive topics.
These results demonstrate that our current models are delivering research-grade synthetic data that meets rigorous statistical standards. We are committed to transparent reporting of both successes and limitations, with confidence interval analysis and subgroup performance to help researchers make informed decisions about synthetic data applicability.
Quality Assurance & Outlier Detection
Beyond statistical validation, we've implemented AI-powered quality assurance to identify potential outlier questions that may not meet expected performance standards, even when projects lack comparative live data.
Automated Outlier Detection: Our system analyzes response patterns using distribution analysis, mutual information calculations, and semantic measurements to flag questions that deviate significantly from expected patterns. This proactive approach helps identify potential data quality issues before final delivery.
Detection Methods
When outlier questions are detected, they are flagged for manual review and may be regenerated using enhanced parameters or excluded from final datasets pending client consultation. This quality assurance layer provides an additional safeguard for projects without live comparison data.
For detailed technical information about our outlier detection methodology, see our Quality Assurance page.
Understanding Our Metrics with Examples
We use several statistical measures to quantify how closely synthetic data matches live survey distributions. Here are real-world examples showing what different performance levels look like:
KL-Divergence
Measures how much one probability distribution differs from another. Values closer to 0 indicate better alignment.
"Which smartphone brand do you prefer most?"
Apple: 35%
Samsung: 25%
Google: 20%
OnePlus: 15%
Other: 5%
Apple: 32%
Samsung: 28%
Google: 18%
OnePlus: 16%
Other: 6%
Apple: 30%
Samsung: 30%
Google: 22%
OnePlus: 13%
Other: 5%
Apple: 42%
Samsung: 18%
Google: 22%
OnePlus: 12%
Other: 6%
Apple: 50%
Samsung: 15%
Google: 10%
OnePlus: 20%
Other: 5%
Spearman Correlation
Measures how well rank order relationships are preserved between live and synthetic data.
"Rank these smartphone features by importance to you"
1. Battery Life
2. Camera Quality
3. Screen Size
4. Storage
5. Brand Name
1. Battery Life
2. Camera Quality
3. Storage
4. Screen Size
5. Brand Name
1. Camera Quality
2. Battery Life
3. Screen Size
4. Storage
5. Brand Name
1. Battery Life
2. Screen Size
3. Camera Quality
4. Storage
5. Brand Name
1. Screen Size
2. Brand Name
3. Battery Life
4. Camera Quality
5. Storage
Top-K Accuracy
Percentage of top response choices that match between live and synthetic data.
"Which news sources do you use regularly? (Select all that apply)"
1. Social Media (65%)
2. TV News (52%)
3. News Websites (41%)
1. Social Media (63%)
2. TV News (54%)
3. News Websites (39%)
1. Social Media (61%)
2. News Websites (48%)
3. Podcasts (44%)
1. TV News (58%)
2. Podcasts (47%)
3. Radio (41%)
BERTScore F1
Semantic similarity measure for text responses using contextual embeddings.
"What do you like most about your current smartphone?"
"The camera quality is amazing and the battery lasts all day"
"Great camera and excellent battery life"
"The photos look good and it doesn't need charging often"
"Camera works well, battery is decent"
"I like the design and how fast it works"
Performance Benchmarks by Question Type
Each question format requires different validation approaches. The benchmarks below represent the statistical thresholds our current synthetic data models consistently meet or exceed:
| Question Type | Metric(s) | Achieved Benchmark | Notes |
|---|---|---|---|
| Single-choice Likert scales, categorical |
KL-Divergence JS-Divergence |
KL < 0.10 JS < 0.05 |
We verify that response percentages match closely between live and synthetic data. For rating scales, we may group similar ratings together to improve accuracy. |
| Multi-choice Select all that apply |
JS-Divergence Spearman Top-K |
JS < 0.05 Spearman > 0.75 Top-K > 0.8 |
We check each option individually and verify that popular choices remain popular. Standards become stricter with more answer options available. |
| Numeric (binned) Age ranges, income brackets |
KL-Divergence JS-Divergence |
KL < 0.10 JS < 0.05 |
Age groups and income brackets are tested without the model seeing the actual live data ranges, ensuring unbiased validation. |
| Percent-allocation Budget allocation, time spent |
KL-Divergence JS-Divergence Top-K |
KL < 0.10 JS < 0.05 Top-K > 0.8 |
We verify both the percentage splits and that the most important items remain ranked highest in synthetic data. |
| Ranking Priority ordering, preferences |
Spearman Top-K |
Spearman > 0.75 Top-K > 0.8 |
Focus on preserving the order of importance. Longer lists of items to rank require higher accuracy standards. |
| Text responses Open-ended questions |
BERTScore F1 Optimal Matching |
BERTScore F1 > 0.75 OMS > 0.75 |
We measure both the meaning similarity of responses and whether response patterns (length, topics, sentiment) match the original survey. |
Note: These benchmarks represent current model performance validated against live survey data. All validation studies include confidence interval analysis and minimum sample size requirements (typically n≥300 for reliable divergence measurements).
How to Read Validation Data
When we publish validation results, they follow a standardized format to ensure transparency and reproducibility:
- Distribution Tables: Side-by-side percentage breakdowns showing live survey responses vs. synthetic responses
- Sample Sizes: Both live (n=X) and synthetic (n=Y) sample sizes clearly stated
- Metric Summary: Calculated divergence/correlation values with pass/fail against our standards
- Confidence Intervals: Where applicable, 95% confidence intervals for key metrics
- Subgroup Analysis: Performance across demographic segments when sample sizes permit
Validation Schedule: We aim to publish new validation studies monthly, with ongoing studies rotating across different domains and question types. All raw validation data will be made available for independent verification.
Upcoming Validation Tables
For each completed study, we publish side-by-side aggregate distributions of live vs synthetic responses, along with metric summaries and statistical tests. Current studies in progress:
Consumer Brand Tracking Study
Live n=1,200 vs Synthetic n=1,200 | Single-choice brand preference questions
Expected completion: Q2 2024
Healthcare Treatment Preferences
Live n=800 vs Synthetic n=800 | Multi-choice and ranking questions
Expected completion: Q2 2024
Political Opinion Tracking
Live n=1,500 vs Synthetic n=1,500 | Mixed question types including text responses
Expected completion: Q3 2024
B2B Technology Adoption
Live n=600 vs Synthetic n=600 | Percent-allocation and ranking questions
Expected completion: Q3 2024
Limitations & Transparency
We are committed to transparent reporting of synthetic data limitations and boundary conditions:
- Highly Emotional Topics: Reduced accuracy observed for traumatic or deeply personal experiences
- Cultural Coverage: Performance gaps in cultural contexts with limited training data representation
- Emerging Phenomena: Novel social trends or unprecedented events require model recalibration
- Complex Interactions: Multi-way demographic and psychographic interactions may be simplified
- Temporal Drift: Periodic recalibration needed (every 6-12 months) to track genuine population shifts
Transparency Commitment: We will publish validation failures alongside successes, document specific use cases where synthetic data underperforms, and provide clear guidance on appropriate applications. Understanding limitations is essential for responsible synthetic data use.
Validation Roadmap
Our systematic approach to expanding validation coverage:
- Phase 1 (Current): Core question types across consumer, healthcare, and political domains
- Phase 2 (Q3 2024): Complex survey logic, skip patterns, and multi-wave studies
- Phase 3 (Q4 2024): Cross-cultural validation and specialized professional populations
- Phase 4 (2025): Academic partnerships and peer-reviewed publication of methodology
Each validation study strengthens our understanding of synthetic data capabilities and helps establish industry standards for this emerging methodology.