Skip to main content

Re-Scoring Past Runs

Phone calls cost money. Re-scoring lets you try different checks against an existing run without placing a single new call. The transcripts and recordings are already on Okareo's side. Only the evaluation step runs again.

When to Use This

  • Tuning thresholds. You want to see what the run's result_completed rate would be with a stricter judge prompt.
  • Comparing check definitions. You shipped a new version of a custom check; rescore last week's runs to validate it.
  • Retroactive analysis. A new check (e.g. empathy_score) didn't exist when the run happened. Add it now without re-calling.
  • Cost discipline. Iterating on quality criteria during development without burning real call minutes.

How to Re-Score

In the App

  1. Open any finished voice simulation run.
  2. Click the Re-score button.
  3. In the Re-score run modal, select the checks you want to apply. The selector groups Okareo's published checks and your custom checks; mix them freely.
  4. Adjust the Re-score name if you want, then click Re-score.
  5. The new run appears in your simulations list, tagged with a Source link back to the original. Open both to compare check results on the same conversations.
Run header actions on a finished run, including the Re-score buttonRun header actions on a finished run, including the Re-score button Re-score run modal with check selector and run name fieldRe-score run modal with check selector and run name field

From the SDK

okareo.re_evaluate(...) accepts a list of checks and an optional new name. It returns a new test run with the same conversations and the new scores.

import os
from okareo import Okareo

okareo = Okareo(os.environ["OKAREO_API_KEY"])

# 1. Find the source run you want to re-evaluate
all_runs = okareo.find_test_runs(name="Multi-Scenario Voice Sim")
source = next(r for r in all_runs if r["status"] == "FINISHED")

# 2. Re-evaluate with new checks
new_run = okareo.re_evaluate(
test_run_id=source["id"],
checks=["response_consistency", "total_turn_count"],
name="Rescore - New Checks",
)
print(f"New run: {new_run.id}")
print(f"Results: {new_run.app_link}")
ArgumentNotes
test_run_idID of the source run to re-evaluate.
checksList of check names or check UUIDs. Mix of predefined and custom checks works.
nameOptional. Name for the new re-evaluated run; helps distinguish it in the run list.

What Gets Reused vs Recomputed

Reused (from the source run)Recomputed (in the new run)
Phone calls (no new dialing)Per-turn check scores
TranscriptsPer-row aggregate scores
Audio recordingsRun-level mean / percentile scores
Driver and Target configurationsPer-check pass/fail status
Scenario rows and metadataApp link / run ID

The new run is a sibling of the source: same conversations, fresh evaluation pass.

Using It in Your Workflow

A common pattern: keep a "canonical" voice run that exercises representative scenarios, then re-score it whenever you change a check.

Re-score branching workflow: a single Canonical Run branches into multiple re-scored sibling runs (v2 from a new check, v3 from threshold tuning, v4 from adding empathy_score), all sharing the same conversationsRe-score branching workflow: a single Canonical Run branches into multiple re-scored sibling runs (v2 from a new check, v3 from threshold tuning, v4 from adding empathy_score), all sharing the same conversations

Each re-score is cheap, fast, and produces a comparable run alongside the original.

Where to Go Next

Cookbook

Full runnable script: 05_rescore.py