Re-Scoring Past Runs

Phone calls cost money. Re-scoring lets you try different checks against an existing run without placing a single new call. The transcripts and recordings are already on Okareo's side. Only the evaluation step runs again.

When to Use This

Tuning thresholds. You want to see what the run's result_completed rate would be with a stricter judge prompt.
Comparing check definitions. You shipped a new version of a custom check; rescore last week's runs to validate it.
Retroactive analysis. A new check (e.g. empathy_score) didn't exist when the run happened. Add it now without re-calling.
Cost discipline. Iterating on quality criteria during development without burning real call minutes.

How to Re-Score

In the App

Open any finished voice simulation run.
Click the Re-score button.
In the Re-score run modal, select the checks you want to apply. The selector groups Okareo's published checks and your custom checks; mix them freely.
Adjust the Re-score name if you want, then click Re-score.
The new run appears in your simulations list, tagged with a Source link back to the original. Open both to compare check results on the same conversations.

Run header actions on a finished run, including the Re-score button

Re-score run modal with check selector and run name field

From the SDK

okareo.re_evaluate(...) accepts a list of checks and an optional new name. It returns a new test run with the same conversations and the new scores.

import os
from okareo import Okareo

okareo = Okareo(os.environ["OKAREO_API_KEY"])

# 1. Find the source run you want to re-evaluate
all_runs = okareo.find_test_runs(name="Multi-Scenario Voice Sim")
source = next(r for r in all_runs if r["status"] == "FINISHED")

# 2. Re-evaluate with new checks
new_run = okareo.re_evaluate(
    test_run_id=source["id"],
    checks=["response_consistency", "total_turn_count"],
    name="Rescore - New Checks",
)
print(f"New run: {new_run.id}")
print(f"Results: {new_run.app_link}")

Argument	Notes
`test_run_id`	ID of the source run to re-evaluate.
`checks`	List of check names or check UUIDs. Mix of predefined and custom checks works.
`name`	Optional. Name for the new re-evaluated run; helps distinguish it in the run list.

What Gets Reused vs Recomputed

Reused (from the source run)	Recomputed (in the new run)
Phone calls (no new dialing)	Per-turn check scores
Transcripts	Per-row aggregate scores
Audio recordings	Run-level mean / percentile scores
Driver and Target configurations	Per-check pass/fail status
Scenario rows and metadata	App link / run ID

The new run is a sibling of the source: same conversations, fresh evaluation pass.

Using It in Your Workflow

A common pattern: keep a "canonical" voice run that exercises representative scenarios, then re-score it whenever you change a check.

Re-score branching workflow: a single Canonical Run branches into multiple re-scored sibling runs (v2 from a new check, v3 from threshold tuning, v4 from adding empathy_score), all sharing the same conversations

Each re-score is cheap, fast, and produces a comparable run alongside the original.

Where to Go Next

Voice Checks: the catalog of checks available to re-score with.
Experimentation and A/B Testing: once you have multiple runs (originals or re-scores), compare them in the app.

Cookbook

Full runnable script: 05_rescore.py

When to Use This​

How to Re-Score​

In the App​

From the SDK​

What Gets Reused vs Recomputed​

Using It in Your Workflow​

Where to Go Next​