Skip to main content

Load Testing

Load testing isn't just about the AI model. It stress-tests the entire stack: network, routing engine, AI agent, CRM integrations, and agent desktop. Latency that looks acceptable at 1 concurrent call often becomes a trust-eroding 3+ second gap at 20.

Drive concurrent calls at your voice agent and measure how latency, success rate, and infrastructure behave under volume. Voice load tests catch failure modes that single-call regression tests cannot: queue saturation, provider rate limits, agent backend timeouts, and degradation patterns that only emerge above N concurrent sessions.

What Load Testing Catches

Failure modeSymptom in results
Agent backend saturationavg_turn_taking_latency p90 climbs sharply above baseline
Telephony / TTS rate limitsConversations fail to start; result_completed rate drops
Memory or queue leaks under sustained loadLatency degrades over the course of the run, not at start
Cascading timeoutsresponse_loop flips for some conversations as agent retries fragment
Routing engine bottlenecksTail latency (p95, p99) diverges from p50

The Setup

Two knobs control load:

  • Max Parallel Requests on the voice target: caps concurrent calls hitting your agent.
  • Total volume = scenario rows x repeats.

In the App

  1. Set concurrency on the target. Go to Targets, create or edit your voice target, and set Max Parallel Requests to the desired concurrency (e.g. 10, 20, 50). Leave it empty for unlimited concurrency.
  2. Build the scenario. Create a scenario with representative test cases. More rows means more diversity across concurrent calls.
  3. Configure repeats. In the simulation form under Advanced Settings, set Repeats to multiply the total call volume. For example, 5 scenario rows with 2 repeats produces 10 total calls.
  4. Run and inspect. After the run completes, open the results to see mean / p50 / p90 latency with distribution charts, and pass rates for each check.
Latency check card showing mean, p50, and p90 with the distribution across conversationsLatency check card showing mean, p50, and p90 with the distribution across conversations Load test run summary with latency percentiles, pass rate, and the per-conversation tableLoad test run summary with latency percentiles, pass rate, and the per-conversation table

From the SDK

The same setup is available programmatically:

import os
from okareo import Okareo
from okareo.model_under_test import PhoneTarget, Target
from okareo_api_client.models import ScenarioSetCreate

okareo = Okareo(os.environ["OKAREO_API_KEY"])

driver = okareo.generate_driver_prompt("Customer calling support with a routine account question")

scenario = okareo.create_scenario_set(ScenarioSetCreate(
name="Voice Load Test",
seed_data=okareo.seed_data_from_list([
{"input": "What's your account balance?", "result": "Agent provides balance"},
{"input": "When does your subscription renew?", "result": "Agent provides renewal date"},
{"input": "Get a copy of your last invoice.", "result": "Agent sends invoice"},
{"input": "Is there a fee to upgrade your plan?", "result": "Agent explains upgrade costs"},
{"input": "How do I add a second user?", "result": "Agent explains multi-user setup"},
]),
))

result = okareo.run_simulation(
name="Load Test - Voice Quality",
target=Target(
name="My Voice Agent",
target=PhoneTarget(phone_number="+1XXXXXXXXXX", max_parallel_requests=10),
),
scenario=scenario,
driver=driver,
max_turns=3,
repeats=2, # 5 scenarios x 2 repeats = 10 concurrent calls
checks=["avg_turn_taking_latency", "result_completed"],
)
Scaling concurrency

Default plans cap concurrency at a low level for safety. Okareo scales to hundreds or thousands of concurrent calls. Reach out to configure higher concurrency for your plan.

Reading Percentile Scores

Latency-style checks (avg_turn_taking_latency, avg_words_per_minute) get server-computed percentile scores in addition to means.

In the App

The run detail page shows mean, p50, and p90 values on latency score cards, with a distribution chart of per-conversation values. The per-conversation table below lists each call's latency; sort by the check column to identify which conversations had the worst latency, and click Detail to inspect them.

From the SDK

metrics = result.model_metrics.to_dict()
scores = metrics["mean_scores"]
percentiles = metrics.get("percentile_scores", {})

latency_pct = percentiles.get("avg_turn_taking_latency", {})

print(f" Mean latency: {scores.get('avg_turn_taking_latency')} ms")
print(f" p50: {latency_pct.get('p50')} ms")
print(f" p90: {latency_pct.get('p90')} ms")
Aggregation levelWhat it represents
Per-turn raw sampleEach turn's latency between caller-stops and agent-replies
Per-conversation avg_turn_taking_latencyAverage across the call's turns (in scores_by_row)
Run-level p50 / p90Percentile across the run's per-conversation averages

This is why p50/p90 from a load run is meaningful: it tells you what fraction of conversations (not turns) had degraded latency, which is the user-visible failure unit.

For the full set of checks that get percentile aggregation, see Voice Checks.

Interpreting Results

Run a baseline at low concurrency first. Then run progressively higher loads and watch:

PatternMeaning
Mean and p50 stable, p90 climbsTail-latency issue. A subset of calls is hitting a slow path.
Mean climbs proportionally to concurrencyBackend saturation. Compute or queue depth is the bottleneck.
result_completed drops at high concurrencyConversations are failing partway through, not just slowing.
Pattern cleaner at low max_parallel_requestsYour agent has a per-second limit lower than you thought.

Where to Go Next

Cookbook

Full runnable script: 08_load_test.py