Okareo MCP
Simulation and Evals in Your Editor
The Okareo MCP server turns your AI coding assistant into an evaluation and testing workbench. Instead of writing SDK scripts or switching to the dashboard, you describe what you want in natural language and your copilot handles the rest — creating scenarios from your codebase, running simulations against your agents, and pulling back results for analysis.
This works with any MCP-ready editor — including Claude Code, Cursor, Cline, GitHub Copilot, and Windsurf. See the Configuration page for setup instructions.
What Makes This Different
With the SDK or CLI, you write explicit code to define scenarios, register models, and run tests. With the MCP, your copilot reads your project, understands your domain, and orchestrates Okareo tools on your behalf. This means you can:
- Generate test content from your code — point your copilot at a module and ask it to produce scenarios that cover the key behaviors.
- Iterate conversationally — run a simulation, review the transcript, adjust the driver persona, and re-run without writing any code.
- Compare and analyze across runs — ask your copilot to pull results from multiple test runs and summarize what changed.
Creating Scenarios
Scenarios are the foundation of every evaluation in Okareo. With the MCP, your copilot can build them directly from your project context.
From your codebase:
"Read through the customer support handlers in
src/handlers/and create a scenario with 10 realistic support questions and expected answers."
From existing data:
"Take the edge cases we documented in
tests/fixtures/edge_cases.jsonand save them as an Okareo scenario called 'Edge Case Coverage'."
Versioning:
"Create a new version of the 'Product Q&A' scenario that adds 5 questions about the new pricing tier."
Your copilot uses save_scenario to create the dataset, list_scenarios to find existing ones, and create_scenario_version to track iterations.
Setting Up Drivers
Drivers define the simulated user personas that interact with your system during multi-turn simulations. The persona, tone, and objectives of the driver shape the conversation.
Define a persona:
"Create a driver called 'Frustrated Customer' that simulates a user who is upset about a billing error, asks pointed questions, and escalates if not satisfied within 3 turns."
Tailor to your domain:
"Based on the user types described in our product spec, create drivers for a first-time user, a power user, and an adversarial tester."
Your copilot uses create_or_update_driver to define personas, get_driver to inspect a driver's configuration, and list_drivers to manage existing ones.
Running Simulations
Multi-turn simulations test how your system handles real conversations. The MCP orchestrates the full flow:
- Define a target — the system under test (a hosted model or your own endpoint)
- Choose a driver — the simulated user persona
- Select a scenario — the test inputs and expected behaviors
- Run the simulation — Okareo alternates turns between driver and target
- Review results — transcripts, per-turn check scores, and aggregate metrics
Example:
"Set up a target pointing to our customer service agent at
https://api.example.com/chat, use the 'Frustrated Customer' driver, run a simulation with the 'Billing Issues' scenario, and show me the results."
Your copilot uses create_or_update_target, get_target, and list_targets to manage targets, run_simulation to execute conversations, and get_test_run_results to handle everything end-to-end.
Getting Results and Comparing Runs
After running evaluations or simulations, you can ask your copilot to retrieve and analyze the results.
Review a single run:
"Show me the detailed results from the last test run, including per-row scores and any failures."
Compare across runs:
"Compare the results from yesterday's 'Product Q&A' evaluation against today's run. What scores improved? What regressed?"
Identify patterns:
"Look at the last 5 simulation runs for the 'Frustrated Customer' driver. Are there turns where the agent consistently fails to de-escalate?"
Your copilot uses list_test_runs to find runs, get_test_run_results for detailed scores, and list_simulations to browse simulation history.
Custom Checks
You can define evaluation criteria directly from your editor:
- From a description: "Create a check called 'stays_on_topic' that fails if the model response discusses anything outside of our product domain."
- From a template: "Get the boolean check template and create a check that verifies the response includes a specific disclaimer."
- Code-based: "Create a Python check that compares response length to the expected result and passes if within 20%."
Your copilot uses generate_check for AI-generated checks, get_templates for prompt templates, and create_or_update_check to save them.
Next Steps
- Configuration — set up the MCP server in your editor
- MCP Reference — full list of available tools and environment variables
- Simulations — learn more about multi-turn simulation concepts
- Scenarios — learn more about scenario design and formatting