CI Gating
Automate quality gates in your CI pipeline. Run a voice simulation suite on every push, pull request, or nightly schedule, and fail the build when key checks drop below threshold.
How It Works
- Define a scenario with representative test cases.
- Run a simulation against your voice agent.
- Compare the resulting check scores against your thresholds.
- Exit non-zero if any threshold is missed, blocking the deploy.
The Okareo run link is printed on every execution, so a failed CI build always points reviewers at the actual conversations.
Example
import os
import sys
from okareo import Okareo
from okareo.model_under_test import PhoneTarget, Target
from okareo_api_client.models import ScenarioSetCreate
okareo = Okareo(os.environ["OKAREO_API_KEY"])
target = Target(name="My Voice Agent", target=PhoneTarget(phone_number="+1XXXXXXXXXX"))
driver = okareo.generate_driver_prompt("Customer calling support with a routine account question")
scenario = okareo.create_scenario_set(ScenarioSetCreate(
name="CI Gate",
seed_data=okareo.seed_data_from_list([
{"input": "Reset your account password. Confirm you received the reset email.",
"result": "Agent walks through password reset process"},
{"input": "Find out weekend business hours. Confirm Saturday vs Sunday.",
"result": "Agent provides weekend hours"},
]),
))
result = okareo.run_simulation(
name="CI Gate - Voice Quality",
target=target,
scenario=scenario,
driver=driver,
max_turns=4,
repeats=2,
checks=["avg_turn_taking_latency", "result_completed", "response_loop"],
)
scores = result.model_metrics.to_dict().get("mean_scores", {})
THRESHOLDS = {
"result_completed": 0.75,
"response_loop": 0.75,
}
print(f"Results: {result.app_link}")
failed = False
for check, threshold in THRESHOLDS.items():
score = scores.get(check)
if score is None:
print(f" {check}: MISSING - FAILED")
failed = True
elif score >= threshold:
print(f" {check}: {score:.2f} >= {threshold} - PASSED")
else:
print(f" {check}: {score:.2f} < {threshold} - FAILED")
failed = True
sys.exit(1 if failed else 0)
Where to Go Next
- Experimentation and A/B Testing: when you need statistical comparison between two configurations, not just a threshold check.
- Scheduling Simulations: wire this script into GitHub Actions, CircleCI, or a nightly cron.
- Voice Checks: choose which checks to gate on.
Cookbook
Full runnable script: 07_ci_gate.py