Research foundation

Methodology & Validation Protocol

How we generate synthetic respondents, prevent leakage, and validate results against live benchmarks.

Study Design

Every validation study follows a paired-comparison protocol: one live panel dataset and one synthetic dataset generated from the same survey instrument, with identical quotas and demographic targets. This ensures that differences in output reflect model performance, not design artifacts.

Paired Comparisons

Live and synthetic datasets are collected from the same questionnaire under the same conditions. The live dataset serves as the benchmark. Synthetic respondents are generated without access to the live results, ensuring clean separation between training data and test data.

Unit of Analysis

The primary unit of analysis is the question-level distribution, not the individual respondent. We compare the distribution of responses across all answer options for each question, using statistical measures appropriate to the question type.

Model Freeze & Provenance

Before any validation study begins, the model version is frozen and recorded. All synthetic data is generated from the locked model, and the model version, prompt templates, and generation parameters are logged for full reproducibility.

Performance snapshot: Current models meet published benchmarks on ~80-90% of questions across completed validation studies. Performance varies by question type, domain, and topic sensitivity. We report all results, including failures.

Encoding Rules & Leakage Controls

Each question type has specific encoding and comparison rules designed to prevent information leakage and ensure fair, reproducible measurement.

  • Single-choice / Likert: Responses are compared as categorical distributions. For Likert scales, collapsing (e.g., top-2 box) is permitted and pre-declared before analysis.
  • Multi-choice: Each option is treated as an independent binary variable. Distributions are compared per-option to avoid inflating agreement scores.
  • Numeric: Continuous values are binned into pre-declared ranges before comparison. Bin boundaries are set before the synthetic data is generated.
  • Percent-allocation: Allocation shares are compared as distributions across options. Dominant allocation patterns (e.g., one option receiving >50%) are tracked separately.
  • Ranking: Rank-order data is compared using correlation and top-K overlap metrics. Metrics scale with list length to account for difficulty.
  • Text responses: Open-ended responses are compared using semantic similarity (BERTScore) and optimal matching scores. No exact-match comparison is used.
  • Weighting parity: Both live and synthetic datasets use the same weighting scheme. If the live data is unweighted, synthetic data is also compared unweighted.
  • Separation of roles: The team that designs the survey instrument does not have access to the synthetic generation pipeline. The team that generates synthetic data does not see live results until after generation is complete.

Metrics & Pass/Fail Standards

Each question type is evaluated using metrics appropriate to its data structure. The table below defines the primary metrics, passing thresholds, and relevant notes for each type.

Question Type Metrics Standards Notes
Single-choice KL-Divergence, JS-Divergence KL < 0.10, JS < 0.05 Likert collapsing permitted
Multi-choice JS-Divergence, Spearman, Top-K JS < 0.05, Spearman > 0.75, Top-K > 0.8 Per-option binaries
Numeric (binned) KL-Divergence, JS-Divergence KL < 0.10, JS < 0.05 Bins pre-declared
Percent-allocation KL-Divergence, JS-Divergence, Top-K KL < 0.10, JS < 0.05, Top-K > 0.8 Dominant allocations
Ranking Spearman, Top-K Spearman > 0.75, Top-K > 0.8 Scale with list length
Text responses BERTScore F1, Optimal Matching F1 > 0.75, OMS > 0.75 Semantic similarity

Reporting Conventions

All validation reports follow a standardized structure to ensure comparability across studies and domains.

  • Full distribution tables: Live and synthetic distributions side-by-side for every question, with percentage point differences.
  • Sample sizes: Both live (N) and synthetic (N) sample sizes reported per question and per subgroup.
  • Metric summary: Pass/fail status for each metric on each question, with the computed value and the threshold.
  • Confidence intervals: 95% bootstrap confidence intervals on all key metrics.
  • Subgroup breakdowns: Key demographic subgroups analyzed separately (age, gender, income) where sample sizes permit.
  • Failure documentation: Questions that fail any metric are flagged, with analysis of likely causes and recommended mitigations.
  • Model provenance: Model version, generation date, and configuration parameters recorded in every report.

Domain-Specific Models

Simsurveys operates three specialized AI models, each trained on validated population data and fine-tuned for its research domain.

Healthcare & HCP

Physician prescribing patterns, patient experiences, treatment satisfaction. Augmented with U.S. physician-level prescription data across 15 medical specialties.

Consumer & Market

Product preferences, brand perception, purchase behavior. Validated against tier-one consumer panels with demographic and psychographic targeting.

Social & Political

Public opinion, policy preferences, voting behavior. Validated against major national polls with geographic and demographic structure.

Transparency & Limitations

We believe that transparent reporting of both strengths and weaknesses is essential for responsible use of synthetic survey data.

Known Limitations

  • Highly sensitive topics: Questions involving trauma, deeply personal experiences, or strong social desirability bias show reduced accuracy (10-20% lower alignment).
  • Emerging phenomena: Novel social trends, unprecedented events, or rapidly shifting attitudes require model updates before synthetic data can reflect them.
  • Cultural specificity: Models are primarily trained on U.S. population data. Non-U.S. contexts require separate validation and may show reduced accuracy.
  • Complex interactions: Multi-way demographic interactions (e.g., age x income x region x education) may be oversimplified in synthetic outputs.
  • Temporal drift: Models require periodic recalibration (every 6-12 months) to account for genuine population shifts.

Disclosure

All Simsurveys outputs are clearly labeled as synthetic data. We recommend that clients disclose the use of synthetic data in any published research and include appropriate caveats about the methodology.

Intellectual Property

Our methodology is protected by U.S. Patent Application No. 18/784,418 and additional provisional patents. This protects the integrity and uniqueness of the approach while allowing full transparency in how results are validated.

Research Documentation

Explore the full body of research supporting our methodology and validation approach.

Validation Framework

Statistical validation methodology, completed benchmark reports, and performance metrics across domains.

View framework →
Mode Effects Research

Historical analysis of survey mode effects and how synthetic data compares to traditional collection methods.

Read research →
AI & Human Prediction

Cognitive psychology research on how AI models predict human survey responses and where they diverge.

Explore findings →

See the methodology in action.

Generate your first synthetic dataset and compare against your own benchmarks.