AI Agent & Integration Testing
Voice agents sit at the front of your business. When one misfires (wrong policy, missed backend lookup, hallucinated refund status) the customer's only recourse is to hang up and call again, or churn. And because an agent that works perfectly over text may fail the same task over voice, the only way to know it works is to test it over voice.
This page covers what to test, how to structure scenarios for each category, and which checks to pair with each concern. All scenarios and checks described here are configured through the same simulation form in the Okareo app or programmatically via the SDK.
What to Validate
Task Completion
The most fundamental question: does the agent actually solve the caller's problem end-to-end?
Design scenario rows where the input describes a clear objective and result describes the expected outcome. The result_completed check evaluates whether the conversation achieved that outcome. Not just whether the agent said something plausible, but whether the task was actually done.
A warranty lookup that ends with "I can help you with that" but never confirms coverage status is a task-completion failure that only surfaces across multiple turns.
The transcript tells you the agent said the task was done. With agent trace capture enabled, the captured traces prove the side effects actually happened — the lookup was made, the update was written.
Compliance and Policy
Does the agent follow your rules? Stays in scope, avoids unauthorized commitments, handles PII correctly, doesn't make promises your business can't keep.
Use response_consistency to catch contradictions and drift across turns. When your scenario's expected result spells out behavioral directives (scope rules, required language), add behavior_adherence to score whether the agent followed them — note it evaluates the agent's final response, not every turn. For domain-specific policy that must hold across the whole conversation (refund limits, disclosure requirements, regulated language), write custom checks that encode your rules directly. See Custom Checks for how to create and version them.
Accuracy and Knowledge Base Grounding
Does the agent answer from your data, or hallucinate? This is especially critical over voice because ASR errors can feed garbled input to a retrieval step. A misheard product name returns the wrong KB article, and the agent confidently delivers wrong information.
Test this by including specific, verifiable facts in your scenario's expected result. If the agent should quote a price, a policy term, or a coverage date, put the correct value in result so checks can verify it.
Agent trace capture catches this failure at its source: the captured traces show the retrieval step's actual inputs and outputs, so you can see the garbled query and the wrong KB article it returned — not just the wrong answer the caller heard.
Guardrails
Does the agent resist jailbreaks, avoid recommending competitors, stay on topic under adversarial pressure?
Use adversarial driver personas that push boundaries: callers who try to get the agent to break character, reveal system prompts, or make off-policy statements. Pair them with model_refusal (did the agent refuse the request it should refuse?) and toxicity (did it stay civil under provocation?). See Red-Teaming for patterns on building adversarial drivers and the checks to pair with them.
Bias Detection
Does performance degrade for certain accents, speaking styles, or demographic personas? Voice benchmarking research confirms that every model tested shows a pass-rate drop between clean and realistic audio conditions. The drop is not uniform. Some accents and speaking patterns are affected more than others.
Surface this by running the same scenario set across diverse persona and augmentation combinations. Compare result_completed rates across personas to identify where your agent underperforms. The fairness check is a complementary signal: it scores the agent's language for bias (1-5). It measures biased language in the output, not performance parity — the persona comparison above is what surfaces parity gaps.
Agent Trace Capture
Everything above scores the conversation from the outside: what the caller heard. Simulation runs a second evaluation loop underneath it: while the simulation is running, your agent's inner traces are captured live, and each one gets its own checks evaluated on it.
If your agent emits OpenTelemetry traces, every LLM invocation, tool call, and retrieval step it makes during the simulated conversation is captured as it happens and scored with trace-level checks:
function_call_validator: did the agent call the right function?function_parameter_accuracy: were the parameters it passed correct for the conversation context?function_result_present: did the function's result actually make it into the agent's response?transcript_fidelity: does the platform-provided transcript match the actual call audio?- Latency, token, and cost metrics: recorded automatically on every captured trace.
This gives every simulated conversation two evaluation layers. Transcript-level checks score what the caller experienced; trace-level checks score what the agent actually did internally to produce it. Trace scores aggregate per conversation and across the run, alongside the transcript checks.
Setup is one-time instrumentation. See Sending Traces via OpenTelemetry for connecting your agent's traces to your simulations.
Backend Integration Testing
Payments, order lookup, CRM, warranty systems: does the agent call the right backend and interpret the response correctly?
The canonical integrated failure mode: a misheard account number causes an API call to fail, and the agent never recovers. This only surfaces when speech recognition, conversation logic, and backend calls are tested together in a live conversation.
Design scenarios that exercise each integration path:
scenario = okareo.create_scenario_set(ScenarioSetCreate(
name="Integration Paths",
seed_data=okareo.seed_data_from_list([
{"input": "Check the warranty status for VIN 1HGCM82633A004352. "
"Confirm whether drivetrain coverage is still active.",
"result": "Agent confirms drivetrain warranty is active until 2027"},
{"input": "Look up order #9923. Confirm the shipping status and ETA.",
"result": "Agent provides shipping status with estimated delivery date"},
{"input": "Process a refund for invoice INV-4410, amount $129.00. "
"Confirm the refund timeline.",
"result": "Agent initiates refund and confirms 5-7 business day timeline"},
]),
))
Each row targets a different backend system. The expected result includes specific values the agent must return from the real integration. Not generic language, but verifiable outputs.
This transcript-level approach verifies what the agent reported. Agent trace capture verifies what it did: function_call_validator confirms the right backend was called, function_parameter_accuracy confirms it was called with the right arguments (the misheard account number shows up here, in the actual API call), and function_result_present confirms the response made it back into the conversation. Use both layers together — a passing transcript with a failing trace check means the agent got the right answer the wrong way.
Checks That Matter Here
Okareo includes 60+ predefined checks. The full catalog is at Checks. The most relevant for agent validation:
result_completed: Did the agent fulfill the caller's objective? The workhorse pass/fail signal.response_consistency: Were the agent's responses coherent and non-contradictory across turns? Catches drift and slot confusion.response_loop: Did the agent get stuck repeating itself?automated_resolution: Did the agent resolve without escalating? Relevant for measuring self-service rate.- Trace-level checks (
function_call_validator,function_parameter_accuracy,function_result_present,transcript_fidelity): Evaluated on the agent's captured inner traces rather than the transcript. See Agent Trace Capture.
For domain-specific validation (price accuracy, policy language, regulated disclosures), build custom checks.
Where to Go Next
- Handoff Testing: when the agent should (or shouldn't) escalate to a human.
- Personas and Scenarios: building the diverse caller personas that exercise these validation categories.
- Voice Augmentation: testing the same scenarios under realistic audio conditions (noise, interruptions, accent diversity).
- Experimentation and A/B Testing: comparing scores across agent versions and gating deploys.