Skip to main content

Voice Checks

Checks score what happened in the conversation. For voice runs, three categories work together: code checks (deterministic Python), model checks (LLM judge on the text transcript), and audio checks (multimodal LLM judge on the actual audio files).

This page covers the predefined voice checks, when to combine them, how results are structured, and where to write your own.

The Three Check Categories

CategoryWhat it doesVoice-relevant examplesWhen to use
CodeDeterministic Python running against transcript metadataavg_turn_taking_latency, avg_words_per_minute, total_turn_count, mu_utterance_error_rateTiming metrics, counting, structural assertions
ModelPrompt evaluated by a judge LLM against the text transcriptresult_completed, response_consistency, response_loop, automated_resolution, behavior_adherenceTask completion, coherence, policy compliance, escalation decisions
AudioMultimodal LLM judge evaluating the actual audio recordingempathy_score, numeric_verbalization, transcript_fidelityTone of voice, spoken accuracy, transcription quality

Audio checks only run on the final turn of a voice simulation because they require extracted audio files and are too expensive for every-turn evaluation. Model and code checks run after every turn.

Selecting Checks for a Run

In the App

In the simulation form, the Checks section has a searchable selector covering Okareo's published checks and your custom checks. You can combine code, model, and audio checks in a single run. The same checks are available whether you run from the app or the SDK.

Checks selector in the simulation form showing Okareo published checksChecks selector in the simulation form showing Okareo published checks

From the SDK

Pass check names in the checks parameter. A single run can apply code, model, and audio checks together:

CHECKS = [
"avg_turn_taking_latency", # code (timing from metadata)
"result_completed", # model (LLM judge on transcript)
"response_consistency", # model
"total_turn_count", # code
"response_loop", # model
"empathy_score", # audio (multimodal judge on recording)
]

result = okareo.run_simulation(
name="Voice Checks Demo",
target=target,
scenario=scenario,
driver=driver,
max_turns=5,
checks=CHECKS,
calculate_metrics=True,
)

Common Starting Checks for Voice Runs

Okareo includes 60+ predefined checks across code, model, and audio categories. The full catalog is at Checks. The checks below are the most common starting point for voice simulations:

Code checks (deterministic, from metadata):

  • avg_turn_taking_latency: Average time (ms) between the caller finishing and the agent starting to reply. The primary responsiveness metric for voice.
  • avg_words_per_minute: Average speaking rate of the voice target. Useful for detecting rushed or sluggish delivery.
  • total_turn_count: How many user→assistant turn pairs the conversation took.

Model checks (LLM judge on transcript):

  • result_completed: Did the agent fulfill the caller's objective? Compares the full conversation against the scenario's expected result. The workhorse pass/fail signal.
  • response_consistency: Were the agent's responses coherent and non-contradictory across turns? Strict on entities, IDs, numbers, and dates.
  • response_loop: Did the agent get stuck repeating itself without progressing?
  • automated_resolution: Did the agent resolve without escalating to a third party? Key for handoff testing.

Audio checks (multimodal judge on audio files, final-turn only):

  • empathy_score: Does the agent's voice tone express empathy? Scored 1 to 5.
  • numeric_verbalization: Were phone numbers, prices, dates, and alphanumeric sequences spoken correctly?
  • transcript_fidelity: How accurately does the platform transcript match the actual audio recording? Scored 1 to 5.

For domain-specific assertions, build custom checks. Python code checks for deterministic rules, or LLM model checks for subjective evaluation.

Reading Results

In the App

The run detail page shows a score summary card for each check you selected. Pass/fail checks display the pass rate with a progress ring. Latency-style checks show mean, p50, and p90 values with a distribution chart. Below the cards, the per-conversation table lists each check's score per call; click Detail on a row to see individual scores alongside the transcript and audio for that conversation.

Score summary cards on the run detail page: latency card with mean, p50, and p90 plus pass-rate rings for boolean checksScore summary cards on the run detail page: latency card with mean, p50, and p90 plus pass-rate rings for boolean checks

From the SDK

result.model_metrics.to_dict() returns three useful keys:

KeyWhat it contains
mean_scoresMean of each check across all conversations in the run
scores_by_rowPer-conversation score for each check
percentile_scoresp50 / p90 / p95 / p99 for latency-style checks
metrics = result.model_metrics.to_dict()
mean = metrics["mean_scores"]
percentiles = metrics.get("percentile_scores", {})

latency_pct = percentiles.get("avg_turn_taking_latency", {})
print(f"p50: {latency_pct.get('p50')} ms, p90: {latency_pct.get('p90')} ms")

Percentile Aggregation

Percentile scores (p50, p90, p95, p99) are computed for the checks in PERCENTILE_CHECK_NAMES: latency, avg_turn_latency, avg_turn_taking_latency, and avg_words_per_minute. Other numeric checks report mean scores but do not get percentile breakdowns.

Two-level latency aggregation: per-turn raw samples are averaged into one value per conversation, then p50/p90/p95/p99 percentiles are computed across all per-conversation averages in the runTwo-level latency aggregation: per-turn raw samples are averaged into one value per conversation, then p50/p90/p95/p99 percentiles are computed across all per-conversation averages in the run
  1. Each turn produces a raw sample (e.g. the latency between the caller's last syllable and the agent's first).
  2. Per conversation, those samples are averaged into one number (the value you see in scores_by_row).
  3. Across all conversations in the run, percentiles are computed from those per-conversation averages.

This conversation-level distribution is the right metric for load testing because it is not skewed by chatty or quiet individual calls.

Custom Checks

The predefined list covers most voice testing patterns. When you need something specific (domain-bound assertions, regex matches, branded content checks), build a custom check.

  • Code Checks: Python with an evaluate(metadata) method. Deterministic, fast, runs on each row.
  • Model Checks: Prompt template evaluated by a judge LLM. Use for subjective scoring.
  • Custom Checks: How to register, version, and reuse your own checks across runs.
  • Checks Introduction: Full predefined check catalog and category framing.

Checks are versioned: every time you modify a check, Okareo persists a new version, and runs reference the version that was active when they ran. Re-evaluating an old run with a newer version of a check is the Re-Scoring flow.

Where to Go Next

Cookbook

Full runnable script: 03_checks.py