CI Gating

Automate quality gates in your CI pipeline. Run a voice simulation suite on every push, pull request, or nightly schedule, and fail the build when key checks drop below threshold.

How It Works

Define a scenario with representative test cases.
Run a simulation against your voice agent.
Compare the resulting check scores against your thresholds.
Exit non-zero if any threshold is missed, blocking the deploy.

The Okareo run link is printed on every execution, so a failed CI build always points reviewers at the actual conversations.

Example

import os
import sys
from okareo import Okareo
from okareo.model_under_test import PhoneTarget, Target
from okareo_api_client.models import ScenarioSetCreate

okareo = Okareo(os.environ["OKAREO_API_KEY"])

target = Target(name="My Voice Agent", target=PhoneTarget(phone_number="+1XXXXXXXXXX"))
driver = okareo.generate_driver_prompt("Customer calling support with a routine account question")

scenario = okareo.create_scenario_set(ScenarioSetCreate(
    name="CI Gate",
    seed_data=okareo.seed_data_from_list([
        {"input": "Reset your account password. Confirm you received the reset email.",
         "result": "Agent walks through password reset process"},
        {"input": "Find out weekend business hours. Confirm Saturday vs Sunday.",
         "result": "Agent provides weekend hours"},
    ]),
))

result = okareo.run_simulation(
    name="CI Gate - Voice Quality",
    target=target,
    scenario=scenario,
    driver=driver,
    max_turns=4,
    repeats=2,
    checks=["avg_turn_taking_latency", "result_completed", "response_loop"],
)

scores = result.model_metrics.to_dict().get("mean_scores", {})

THRESHOLDS = {
    "result_completed": 0.75,
    "response_loop": 0.75,
}

print(f"Results: {result.app_link}")

failed = False
for check, threshold in THRESHOLDS.items():
    score = scores.get(check)
    if score is None:
        print(f"  {check}: MISSING - FAILED")
        failed = True
    elif score >= threshold:
        print(f"  {check}: {score:.2f} >= {threshold} - PASSED")
    else:
        print(f"  {check}: {score:.2f} < {threshold} - FAILED")
        failed = True

sys.exit(1 if failed else 0)

Where to Go Next

Experimentation and A/B Testing: when you need statistical comparison between two configurations, not just a threshold check.
Scheduling Simulations: wire this script into GitHub Actions, CircleCI, or a nightly cron.
Voice Checks: choose which checks to gate on.

Cookbook

Full runnable script: 07_ci_gate.py

How It Works​

Example​

Where to Go Next​

How It Works

Example

Where to Go Next