Experimentation and A/B Testing

Every change to an AI agent carries risk. A new system prompt, a different LLM, an updated ASR model, a revised tool schema. Any of these can improve some conversations while degrading others. Without controlled experimentation, shipping a change is a bet. Okareo provides the statistical framework to turn that bet into an evidence-based decision.

Why Experiment on AI Agents

Product teams have long used A/B testing to measure the impact of UI changes on user behavior. AI agents need the same rigor, but the variables are different: prompts, models, voice providers, tool definitions, and conversation logic. The question is the same. Did this change actually make things better, or is the movement just noise?

Okareo's Compare Evaluations answers that question with both frequentist and Bayesian statistical testing, purpose-built for AI agent evaluation. Run the same scenarios against two configurations, and get a rigorous statistical comparison across every check.

What You Can Test

Any change to your agent stack is a candidate for experimentation:

Variable	Example
LLM	Swap GPT-4o for Claude Sonnet and measure task completion, latency, and compliance
System prompt	Rewrite agent instructions and compare response quality and guardrail adherence
ASR provider or model	Switch speech recognition and measure whether downstream task accuracy holds
TTS voice or provider	Change the voice and evaluate conversation flow and caller experience metrics
Tool definitions	Update function schemas and verify the agent still calls the right tools at the right time
Augmentation conditions	Add background noise or barge-in to test whether a change holds up under stress
Driver persona	Introduce adversarial or edge-case callers to probe robustness

How It Works

1. Run two evaluations against the same scenario

Change only the variable you are testing. Keep the Driver, Scenario, Checks, and repeats identical so the comparison is fair.

2. Open Compare Evaluations

In the app: Click Compare on any finished run's detail page (or Compare Evaluations on a scenario card). Two cards appear side by side: click each to select the Control evaluation (your baseline) and the Variant (the candidate). The variant list is automatically scoped to finished runs of the same type on the same scenario, so comparisons stay apples-to-apples.

From the SDK: Run two simulations programmatically, then open either in the app and click Compare.

3. Review the statistical results

Okareo computes paired statistical tests across every check that both evaluations have in common:

Column	What it tells you
Samples	Number of matched scenario rows
Control	Check score (pass rate or mean) for the baseline evaluation
Variant	Check score for the candidate evaluation
Raw p-value	McNemar exact (binary checks) or paired t-test (score checks)
Adjusted p-value (BH)	Benjamini-Hochberg correction for multiple comparisons
Chance to Beat	Bayesian posterior probability that the variant outperforms the control
Credible Interval	95% credible interval for the difference between variant and control
Risk to Ship	Expected loss if you ship the variant and it is actually worse
Risk to Keep	Expected loss if you keep the control and the variant is actually better

A check icon marks statistically significant differences (at alpha = 0.05). An approximately-equal icon marks differences that are not significant. Bayesian metrics (Chance to Beat, Credible Interval, Risk to Ship/Keep) require at least 5 scenario rows with differing results.

Compare Evaluations page with Control and Variant cards, the Statistical Test Results table, and per-scenario rows

Reading the Results

Focus on three columns for a ship/no-ship decision:

Adjusted p-value (BH). Is the difference statistically significant after correcting for the number of checks being compared?
Chance to Beat. What is the probability that the variant is genuinely better than the control?
Risk to Ship vs. Risk to Keep. What is the expected downside of each decision?

If Chance to Beat is high (above 80%) and Risk to Ship is low, you have strong evidence to ship. If Risk to Keep is high, staying on the current configuration is costing you real quality.

These Bayesian metrics are designed to be interpretable without a statistics background. Chance to Beat is a direct probability ("there is a 92% chance the new prompt is better"). Risk to Ship and Risk to Keep quantify the stakes in the units of the check itself.

Designing Experiments

For a comparison to be meaningful, hold everything constant except the variable you are testing.

Variable being tested	Hold constant	Vary
Agent prompt change	Driver, Scenario, Checks, `repeats`	Target (or its underlying agent config)
Model swap	Driver, Scenario, Checks, `repeats`	Target (pointing at the new model)
Augmentation impact	Driver, Target, Scenario, Checks, `repeats`	`augmentation` parameter on `run_simulation`
Driver behavior change	Target, Scenario, Checks, `repeats`	Driver
Statistical power	Driver, Target, Scenario, Checks	`repeats` (raise to 5 or 10 for noisy checks)

Total simulations = scenario rows x repeats. See Running Simulations for the full formula.

Sample Size and Statistical Power

A single conversation is an anecdote. Comparing two evaluations at repeats=1 will show movement that is often just noise. Raise repeats until the between-evaluation signal is larger than the within-evaluation variance.

Goal	Suggested setup
Quick sanity check	2 to 4 scenarios, `repeats=1` (about 4 calls). Catches catastrophic regressions only.
Pre-deploy validation	5 to 10 scenarios, `repeats=2` (about 10 to 20 calls). Reasonable signal for binary checks.
Nightly regression suite	8 to 15 scenarios, `repeats=3` (about 24 to 45 calls). Good signal for noisy LLM checks like `response_consistency`.
High-confidence launch decision	15+ scenarios, `repeats=5` (about 75+ calls). For changes you cannot easily roll back.

Re-scoring (instead of re-running) is the right tool when you only want to compare check definitions on the same conversations. See Re-Scoring Past Runs.

Where to Go Next

CI Gating: automate threshold-based quality gates in your CI pipeline.
Load Testing: stress-test under concurrency after validating quality.
Re-Scoring Past Runs: compare check definitions without re-running conversations.
Scheduling Simulations: wire experiments into nightly or per-push automation.

Why Experiment on AI Agents​

What You Can Test​

How It Works​

1. Run two evaluations against the same scenario​

2. Open Compare Evaluations​

3. Review the statistical results​

Reading the Results​

Designing Experiments​

Sample Size and Statistical Power​

Where to Go Next​