Skip to main content

Okareo MCP

Simulation and Evals in Your Editor

The Okareo MCP server turns your AI coding assistant into an evaluation and testing workbench. Instead of writing SDK scripts or switching to the dashboard, you describe what you want in natural language and your copilot handles the rest — creating scenarios from your codebase, running simulations against your agents, ingesting voice conversations for monitoring, and pulling back results for analysis.

Okareo MCP is a hosted at https://tools.okareo.com/mcp. Point your copilot at the URL, sign in through your browser on first connect, and the tools appear in your editor — no install, no Python, no container. It works with any MCP-ready editor, including Claude Code, Claude Desktop, Cursor, VS Code (1.101+), Cline, Gemini Code Assist, GitHub Copilot, and Windsurf. See the Configuration page for setup instructions.

What Makes This Different

With the SDK or CLI, you write explicit code to define scenarios, register models, and run tests. With the MCP, your copilot reads your project, understands your domain, and orchestrates Okareo tools on your behalf. This means you can:

  • Generate test content from your code — point your copilot at a module and ask it to produce scenarios that cover the key behaviors.
  • Iterate conversationally — run a simulation, review the transcript, adjust the driver persona, and re-run without writing any code.
  • Compare and analyze across runs — ask your copilot to pull results from multiple test runs and summarize what changed.
  • Monitor production voice and chat traffic — pipe completed conversations into Okareo for scoring and trend analysis.

Creating Scenarios

Scenarios are the foundation of every evaluation in Okareo. With the MCP, your copilot can build them directly from your project context.

From your codebase:

"Read through the customer support handlers in src/handlers/ and create a scenario with 10 realistic support questions and expected answers."

From existing data:

"Take the edge cases we documented in tests/fixtures/edge_cases.json and save them as an Okareo scenario called 'Edge Case Coverage'."

Versioning:

"Create a new version of the 'Product Q&A' scenario that adds 5 questions about the new pricing tier."

Your copilot uses save_scenario to create the dataset, list_scenarios to find existing ones, and create_scenario_version to track iterations.

Setting Up Drivers

Drivers define the simulated user personas that interact with your system during multi-turn simulations. The persona, tone, and objectives of the driver shape the conversation. Drivers can also be configured with a specific voice and language for voice-mode simulations.

Define a persona:

"Create a driver called 'Frustrated Customer' that simulates a user who is upset about a billing error, asks pointed questions, and escalates if not satisfied within 3 turns."

Tailor to your domain:

"Based on the user types described in our product spec, create drivers for a first-time user, a power user, and an adversarial tester."

Voice-configured drivers:

"List the available driver voices, then create a Spanish-speaking 'Impatient Caller' driver using a natural-sounding female voice."

Your copilot uses create_or_update_driver to define personas, list_driver_voices to discover available voices and languages, get_driver to inspect a driver's configuration, and list_drivers to manage existing ones.

Running Simulations

Multi-turn simulations test how your system handles real conversations. The MCP orchestrates the full flow:

  1. Define a target — the system under test (a hosted model or your own endpoint)
  2. Choose a driver — the simulated user persona
  3. Select a scenario — the test inputs and expected behaviors
  4. Run the simulation — Okareo alternates turns between driver and target
  5. Review results — transcripts, per-turn check scores, and aggregate metrics

Example:

"Set up a target pointing to our customer service agent at https://api.example.com/chat, use the 'Frustrated Customer' driver, run a simulation with the 'Billing Issues' scenario, and show me the results."

Your copilot uses create_or_update_target, get_target, and list_targets to manage targets, run_simulation to execute conversations, list_simulations to browse history, and get_conversation_transcript to inspect individual exchanges.

Getting Results and Comparing Runs

After running evaluations or simulations, you can ask your copilot to retrieve and analyze the results.

Review a single run:

"Show me the detailed results from the last test run, including per-row scores and any failures."

Compare across runs:

"Compare the results from yesterday's 'Product Q&A' evaluation against today's run. What scores improved? What regressed?"

Identify patterns:

"Look at the last 5 simulation runs for the 'Frustrated Customer' driver. Are there turns where the agent consistently fails to de-escalate?"

Re-score without re-running:

"Re-evaluate that test run against the new pii_leak and hallucination checks — don't re-run the model."

Your copilot uses list_test_runs to find runs, get_test_run_results for detailed scores, get_conversation_transcript for individual exchanges, reevaluate_test_run to re-score an existing run against a different set of checks, and list_simulations to browse simulation history.

Custom Checks

You can define evaluation criteria directly from your editor:

  • From a description: "Create a check called 'stays_on_topic' that fails if the model response discusses anything outside of our product domain."
  • From a template: "Get the boolean check template and create a check that verifies the response includes a specific disclaimer."
  • Code-based: "Create a Python check that compares response length to the expected result and passes if within 20%."
  • Versioned: "Show me every version of the tone_polite check and roll back to v3."

Your copilot uses generate_check for AI-generated checks, get_templates for prompt templates, create_or_update_check to save them (with optional tags), and get_check / list_checks (with all_versions=true) to browse history.

Voice & Conversation Monitoring

If your product handles voice calls or chat conversations, you can pipe completed sessions into Okareo for ongoing monitoring — without writing webhook glue.

Connect a provider:

"Connect our Retell account so completed calls flow into Okareo automatically."

Ingest conversations manually:

"Take this list of recent support call transcripts and ingest them into Okareo for monitoring."

Your copilot uses ingest_conversations to submit completed calls (transcripts or audio references), connect_voice_integration / list_voice_integrations / get_voice_integration / update_voice_integration / rotate_voice_integration_secret / delete_voice_integration to manage provider connections (Retell, Twilio, Vapi, ElevenLabs), and get_voice_webhook_url to obtain the inbound endpoint to paste into the provider's console.

Analytics & Dashboards

Once you have a corpus of evaluations and ingested conversations, you can query the underlying analytics from your editor and pin recurring views as dashboards.

Ask a question:

"What's the trend in hallucination_check pass rate over the last 30 days, broken down by model?"

Save a view:

"Save this as a dashboard called 'Weekly Quality Review' and pin it to the top of the list."

Your copilot uses query_analytics to query the underlying cubes (with include_metadata=true to discover available measures and dimensions) and list_dashboards, get_dashboard, save_dashboard, reorder_dashboards, delete_dashboard to curate dashboard views.

Next Steps

  • Configuration — connect your editor to the hosted MCP endpoint
  • MCP Reference — full list of available tools
  • Simulations — learn more about multi-turn simulation concepts
  • Scenarios — learn more about scenario design and formatting