Validation Studies & Case Studies

Validation Framework

Each validation study compares live survey data distributions with synthetic survey data distributions using established statistical divergence measures and correlation tests. Our approach is question type-specific, recognizing that different response formats require different validation metrics.

Current Performance Results: Our synthetic data models currently achieve the benchmarks listed below for approximately 80-90% of tested questions across multiple validation studies. Strongest performance observed on single-choice and ranking formats; comparatively lower alignment on open text responses and highly sensitive topics.

These results demonstrate that our current models are delivering research-grade synthetic data that meets rigorous statistical standards. We are committed to transparent reporting of both successes and limitations, with confidence interval analysis and subgroup performance to help researchers make informed decisions about synthetic data applicability.

Quality Assurance & Outlier Detection

Beyond statistical validation, we've implemented AI-powered quality assurance to identify potential outlier questions that may not meet expected performance standards, even when projects lack comparative live data.

Automated Outlier Detection: Our system analyzes response patterns using distribution analysis, mutual information calculations, and semantic measurements to flag questions that deviate significantly from expected patterns. This proactive approach helps identify potential data quality issues before final delivery.

Detection Methods

Distribution Analysis

AI analysis of response distributions to identify unusual patterns or concentration anomalies

Mutual Information (MI)

Measures information dependencies between variables to detect unexpected relationships

Semantic Analysis

Natural language processing to evaluate response coherence and relevance

Confidence Scoring

Question-level confidence scores based on model training coverage and complexity

When outlier questions are detected, they are flagged for manual review and may be regenerated using enhanced parameters or excluded from final datasets pending client consultation. This quality assurance layer provides an additional safeguard for projects without live comparison data.

For detailed technical information about our outlier detection methodology, see our Quality Assurance page.

Understanding Our Metrics with Examples

We use several statistical measures to quantify how closely synthetic data matches live survey distributions. Here are real-world examples showing what different performance levels look like:

KL-Divergence

Measures how much one probability distribution differs from another. Values closer to 0 indicate better alignment.

Example: Brand Preference Question
"Which smartphone brand do you prefer most?"

Live Data
Apple: 35%
Samsung: 25%
Google: 20%
OnePlus: 15%
Other: 5%

KL = 0.07 (Excellent)
Apple: 32%
Samsung: 28%
Google: 18%
OnePlus: 16%
Other: 6%

KL = 0.12 (Good)
Apple: 30%
Samsung: 30%
Google: 22%
OnePlus: 13%
Other: 5%

KL = 0.18 (Acceptable)
Apple: 42%
Samsung: 18%
Google: 22%
OnePlus: 12%
Other: 6%

KL = 0.28 (Poor)
Apple: 50%
Samsung: 15%
Google: 10%
OnePlus: 20%
Other: 5%

Spearman Correlation

Measures how well rank order relationships are preserved between live and synthetic data.

Example: Feature Importance Ranking
"Rank these smartphone features by importance to you"

Live Data
1. Battery Life
2. Camera Quality
3. Screen Size
4. Storage
5. Brand Name

r = 0.85 (Excellent)
1. Battery Life
2. Camera Quality
3. Storage
4. Screen Size
5. Brand Name

r = 0.78 (Good)
1. Camera Quality
2. Battery Life
3. Screen Size
4. Storage
5. Brand Name

r = 0.65 (Acceptable)
1. Battery Life
2. Screen Size
3. Camera Quality
4. Storage
5. Brand Name

r = 0.45 (Poor)
1. Screen Size
2. Brand Name
3. Battery Life
4. Camera Quality
5. Storage

Top-K Accuracy

Percentage of top response choices that match between live and synthetic data.

Example: Multi-Choice Question
"Which news sources do you use regularly? (Select all that apply)"

Live Data Top 3
1. Social Media (65%)
2. TV News (52%)
3. News Websites (41%)

Top-3 = 100% (Perfect)
1. Social Media (63%)
2. TV News (54%)
3. News Websites (39%)

Top-3 = 67% (Acceptable)
1. Social Media (61%)
2. News Websites (48%)
3. Podcasts (44%)

Top-3 = 33% (Poor)
1. TV News (58%)
2. Podcasts (47%)
3. Radio (41%)

BERTScore F1

Semantic similarity measure for text responses using contextual embeddings.

Example: Open-Ended Question
"What do you like most about your current smartphone?"

Live Response
"The camera quality is amazing and the battery lasts all day"

F1 = 0.78 (Excellent)
"Great camera and excellent battery life"

F1 = 0.68 (Good)
"The photos look good and it doesn't need charging often"

F1 = 0.55 (Acceptable)
"Camera works well, battery is decent"

F1 = 0.35 (Poor)
"I like the design and how fast it works"

Performance Benchmarks by Question Type

Each question format requires different validation approaches. The benchmarks below represent the statistical thresholds our current synthetic data models consistently meet or exceed:

Question Type	Metric(s)	Achieved Benchmark	Notes
Single-choice Likert scales, categorical	KL-Divergence JS-Divergence	KL < 0.10 JS < 0.05	We verify that response percentages match closely between live and synthetic data. For rating scales, we may group similar ratings together to improve accuracy.
Multi-choice Select all that apply	JS-Divergence Spearman Top-K	JS < 0.05 Spearman > 0.75 Top-K > 0.8	We check each option individually and verify that popular choices remain popular. Standards become stricter with more answer options available.
Numeric (binned) Age ranges, income brackets	KL-Divergence JS-Divergence	KL < 0.10 JS < 0.05	Age groups and income brackets are tested without the model seeing the actual live data ranges, ensuring unbiased validation.
Percent-allocation Budget allocation, time spent	KL-Divergence JS-Divergence Top-K	KL < 0.10 JS < 0.05 Top-K > 0.8	We verify both the percentage splits and that the most important items remain ranked highest in synthetic data.
Ranking Priority ordering, preferences	Spearman Top-K	Spearman > 0.75 Top-K > 0.8	Focus on preserving the order of importance. Longer lists of items to rank require higher accuracy standards.
Text responses Open-ended questions	BERTScore F1 Optimal Matching	BERTScore F1 > 0.75 OMS > 0.75	We measure both the meaning similarity of responses and whether response patterns (length, topics, sentiment) match the original survey.

Note: These benchmarks represent current model performance validated against live survey data. All validation studies include confidence interval analysis and minimum sample size requirements (typically n≥300 for reliable divergence measurements).

How to Read Validation Data

When we publish validation results, they follow a standardized format to ensure transparency and reproducibility:

Distribution Tables: Side-by-side percentage breakdowns showing live survey responses vs. synthetic responses
Sample Sizes: Both live (n=X) and synthetic (n=Y) sample sizes clearly stated
Metric Summary: Calculated divergence/correlation values with pass/fail against our standards
Confidence Intervals: Where applicable, 95% confidence intervals for key metrics
Subgroup Analysis: Performance across demographic segments when sample sizes permit

Validation Schedule: We aim to publish new validation studies monthly, with ongoing studies rotating across different domains and question types. All raw validation data will be made available for independent verification.

Upcoming Validation Tables

For each completed study, we publish side-by-side aggregate distributions of live vs synthetic responses, along with metric summaries and statistical tests. Current studies in progress:

Consumer Brand Tracking Study

Live n=1,200 vs Synthetic n=1,200 | Single-choice brand preference questions
Expected completion: Q2 2024

Healthcare Treatment Preferences

Live n=800 vs Synthetic n=800 | Multi-choice and ranking questions
Expected completion: Q2 2024

Political Opinion Tracking

Live n=1,500 vs Synthetic n=1,500 | Mixed question types including text responses
Expected completion: Q3 2024

B2B Technology Adoption

Live n=600 vs Synthetic n=600 | Percent-allocation and ranking questions
Expected completion: Q3 2024

Limitations & Transparency

We are committed to transparent reporting of synthetic data limitations and boundary conditions:

Highly Emotional Topics: Reduced accuracy observed for traumatic or deeply personal experiences
Cultural Coverage: Performance gaps in cultural contexts with limited training data representation
Emerging Phenomena: Novel social trends or unprecedented events require model recalibration
Complex Interactions: Multi-way demographic and psychographic interactions may be simplified
Temporal Drift: Periodic recalibration needed (every 6-12 months) to track genuine population shifts

Transparency Commitment: We will publish validation failures alongside successes, document specific use cases where synthetic data underperforms, and provide clear guidance on appropriate applications. Understanding limitations is essential for responsible synthetic data use.

Validation Roadmap

Our systematic approach to expanding validation coverage:

Phase 1 (Current): Core question types across consumer, healthcare, and political domains
Phase 2 (Q3 2024): Complex survey logic, skip patterns, and multi-wave studies
Phase 3 (Q4 2024): Cross-cultural validation and specialized professional populations
Phase 4 (2025): Academic partnerships and peer-reviewed publication of methodology

Each validation study strengthens our understanding of synthetic data capabilities and helps establish industry standards for this emerging methodology.

Validation Studies & Benchmarks