AI-to-Human Handoff Testing

The handoff decision is the highest-stakes judgment a voice agent makes. Getting it wrong in either direction costs real money:

Premature escalation. The agent gives up when it could have resolved. Every unnecessary transfer wastes human agent minutes and degrades the caller's experience with hold times and repeated context.
Missed escalation. The agent should have handed off but didn't. The caller gets stuck in a loop, hangs up frustrated, and calls back, generating a more expensive interaction and eroding trust.

Simulation lets you probe this boundary from both sides before real callers hit it. Everything described here uses the same simulation form in the Okareo app or SDK that you use for any other voice simulation.

The Two Failure Modes

Unnecessary Escalation

The agent transfers the call even though it had the information and authority to resolve. This inflates human agent workload and signals a lack of confidence in the AI layer.

How to detect it: The automated_resolution check evaluates whether the agent resolved the issue without escalating. If a scenario describes a problem the agent should be able to handle, and automated_resolution fails (returns 0), the agent escalated unnecessarily.

Missing Escalation

The agent tries to handle something beyond its scope: a billing dispute requiring manager approval, a safety concern, or a request it simply can't fulfill. Instead of handing off, it stalls, loops, or gives a wrong answer.

How to detect it: The result_completed check evaluates whether the conversation achieved its goal. When the scenario's expected outcome explicitly requires escalation (e.g., "Agent transfers caller to billing specialist"), a failing result_completed means the agent didn't hand off when it should have.

Check Pairing

The power of handoff testing comes from reading automated_resolution and result_completed together:

`automated_resolution`	`result_completed`	What it means
Pass (1)	Pass (1)	Agent resolved correctly without escalating. Ideal for self-service scenarios.
Fail (0)	Pass (1)	Agent escalated, and the scenario's goal was still met. Correct handoff for scenarios that require it.
Fail (0)	Fail (0)	Agent escalated, but the goal wasn't achieved. Broken handoff: wrong queue, lost context, or premature transfer.
Pass (1)	Fail (0)	Agent didn't escalate, but also didn't solve the problem. Missed escalation, the worst outcome.

The bottom-right cell is the one that burns customer trust. Your scenarios should be designed to detect it.

Checks to Add

The core pair tells you whether the handoff decision was right. Two supporting checks tell you how it failed and how it felt to the caller:

response_loop: detects the agent repeating the same content without adding new information or taking action. This is the signature symptom of a missed escalation — the agent that should have handed off stalls and loops instead. When a row lands in the automated_resolution=1 / result_completed=0 cell, a failing response_loop confirms the agent was spinning rather than making progress.
empathy_score: an audio-native check. The judge listens to the actual call recording and scores the agent's tone and verbiage from 1 to 5, flagging moments where caller frustration or distress went unacknowledged. For escalation scenarios like the fraud dispute and medical emergency below, a correct handoff delivered in a flat, robotic tone still erodes trust — this check catches it.

Add both alongside automated_resolution and result_completed in the same simulation form or SDK checks list.

Designing Scenarios That Probe the Boundary

Include both "resolvable" and "should-escalate" rows in the same scenario set. This tests the agent's ability to make the right call in both directions:

scenario = okareo.create_scenario_set(ScenarioSetCreate(
    name="Handoff Boundary",
    seed_data=okareo.seed_data_from_list([
        # Resolvable: agent should handle without escalating
        {"input": "Check the status of order #7733. Confirm the delivery date.",
         "result": "Agent provides order status and estimated delivery date"},
        {"input": "Update the email address on account ending in 4821.",
         "result": "Agent updates email and confirms the change"},

        # Should escalate: agent cannot or should not resolve alone
        {"input": "Dispute a $340 charge from three months ago. "
                  "You believe it's fraud and want it reversed immediately.",
         "result": "Agent transfers caller to fraud investigation team with case context"},
        {"input": "You're having a medical emergency related to a product. "
                  "You need to speak to someone immediately.",
         "result": "Agent immediately escalates to emergency response team"},
    ]),
))

The result field is what result_completed evaluates against. For escalation scenarios, describe the expected handoff outcome. Not just "agent escalates" but what should happen (right team, context preserved, urgency recognized).

Reading the Results

In the App

Open the completed run. The per-conversation table shows automated_resolution and result_completed for every call; click Detail on a row to see the scores next to the transcript. Use the check pairing table above to interpret each row: resolvable scenarios should pass both checks; escalation scenarios should show automated_resolution=0 with result_completed=1.

From the SDK

After the run completes, look at the check scores across your scenario rows:

scores = result.model_metrics.to_dict()
rows = scores.get("scores_by_row", [])

for i, row_scores in enumerate(rows):
    auto_res = row_scores.get("automated_resolution")
    completed = row_scores.get("result_completed")
    print(f"Row {i}: automated_resolution={auto_res}, result_completed={completed}")

What to look for:

Resolvable rows should show automated_resolution=1 and result_completed=1.
Escalation rows should show automated_resolution=0 (agent handed off) and result_completed=1 (handoff achieved the goal).
Any row with automated_resolution=1 and result_completed=0 is a missed escalation. The agent tried to handle it alone and failed.

Catching Handoff Regressions

A prompt change or model upgrade can shift the escalation boundary without anyone noticing. The agent might become more aggressive (escalating less) or more cautious (escalating more). Both shifts have business impact.

Use Experimentation and A/B Testing to track automated_resolution rate across versions. A sudden shift in either direction warrants investigation. Even if result_completed stays stable, a change in escalation rate changes your staffing economics.

For nightly regression patterns, see Scheduling Simulations.

Where to Go Next

AI Agent & Integration Testing: the broader validation framework that handoff testing fits into.
Voice Checks: understanding how checks are scored and aggregated.
Experimentation and A/B Testing: tracking handoff rates across agent versions.

The Two Failure Modes​

Unnecessary Escalation​

Missing Escalation​

Check Pairing​

Checks to Add​

Designing Scenarios That Probe the Boundary​

Reading the Results​

In the App​

From the SDK​

Catching Handoff Regressions​

Where to Go Next​