Skip to main content

Persona & Behavior Simulation in Multi-Turn Dialogues

Okareo lets you simulate and evaluate full conversations - from straightforward question and answer flows to complex, agent-to-agent interactions. With Multi-Turn Simulations you can:

  • Verify behaviors like persona adherance and task completion across an entire dialog.
  • Stress-test your assistant with adversarial personals.
  • Call out to custom endpoints (such as your own service or RAG pipeline) and evaluate the real responses.
  • Track granular metrics and compare them over time.

Why Multi‑Turn Simulation?

Use Multi‑Turn when success depends on how the assistant behaves over time, not just what it says once.

Single‑Turn EvaluationMulti‑Turn Simulation
Spot‑checks isolated responses.Captures conversation dynamics: context, memory, tool calls, persona drift.
Limited resistance to prompt injections.Lets you inject adversarial or off‑happy‑path turns to probe robustness.
Limited visibility into session state or external calls.Can follow and score API calls, function‑calling, or custom‑endpoint responses throughout the dialog.

Core Concepts

Key Entities

TermWhat it is
TargetThe system under test—either a hosted model (e.g. gpt‑4o‑mini) or a mapping that tells Okareo how to call your service (Custom Endpoint / Custom Model).
DriverA configurable simulation of a user persona. It sends messages to the Target.
ScenarioA reusable collection of Scenario Rows. Each row provides runtime parameters (inserted into the Driver prompt) plus an expected result for checks to judge against.
Custom EndpointA REST mapping (URL, method, headers, body template, JSON paths) that lets Okareo call your running agent, LLM pipeline, RAG service, etc. during a simulation.
Custom ModelA class you implement by subclassing CustomModel in the Okareo SDK (Python or TypeScript). Provide an invoke() method and Okareo treats your proprietary code or on‑prem model as a first‑class, versioned model.
CheckA metric that scores the dialog (numeric or boolean). Built‑ins cover behavior adherence, model refusal, task completion, etc.; you can supply custom checks for your use case.

Execution Objects

TermWhat it is
SimulationA single run that alternates Driver → Target turns using one scenario row, one target, and one driver. It records the conversations between target and driver.
EvaluationThe scoring phase that executes all enabled Checks against the Simulation and produces metrics.

How It Works (High-Level)

At runtime, a Multi-Turn Simulation wires together four things and loops Driver ↔ Target until a stop rule triggers, while scoring along the way.

  1. Compose the building blocks

    • Target — the system under test (hosted model or a Custom Endpoint mapping).
    • Driver — a configurable user persona/behavior policy that generates user turns.
    • Scenario — a table of rows that provide runtime parameters (inserted into the Driver prompt) and an expected result for scoring.
    • Checks — evaluation functions (boolean or numeric) that score behavior, refusals, task completion, etc.
  2. Start a Simulation

    • A Simulation uses one Target, one Driver, and one Scenario row.
    • Okareo instantiates a conversation with the chosen first speaker, applies the Scenario row’s parameters to the Driver prompt, and initializes the run policy (e.g., max turns, optional early-stop via a designated check).
  3. Orchestration loop

    • Turns alternate Driver → Target → Driver → ….
    • If the Target is a Custom Endpoint, Okareo executes your HTTP mapping (URL, headers, body template, JSON paths) and records the raw payloads; hosted models are invoked via Okareo’s model runtime.
    • After each turn, Checks are evaluated and the partial results are attached to that turn.
  4. Stop & finalize

    • The run ends when a stop condition is met (max turns reached, the Driver concludes, or a designated stop check returns true).
    • Checks are computed a final time and aggregated for the Simulation.
  5. Inspect artifacts

    • You get a full transcript, per-turn annotations for each Check, final scores, and (for Custom Endpoints) request/response payload traces — all ready for side-by-side review and comparison over time.

Quick-Start via the UI

New: You can run your first simulation with zero setup. Okareo includes ready-to-use Target, Driver, and Scenario so you can try simulations immediately.

1 · Create your first Simulation (no setup)

  1. Go to Simulations and click + Create Multi-Turn Simulation.

Zero State

  1. The form opens pre-filled with:

    • Target: Example Bank Agent
    • Driver: First-Time User
    • Scenario: Example Driver Scenario
    • Checks: Result Completed

    The preview shows example rows (input parameters + expected results) that checks will use.

Create Form

  1. Click Run Simulation. You’ll return to the Simulations list and see the run progress; when it finishes, the tile shows a score ring and summary.

Index with Simulation Tile

2 · Inspect Results

Click the Simulation tile to open the results.

  • Conversation Transcript — the full back-and-forth between Driver and Target.
  • Checks — per-turn annotations and final scores (e.g., Behavior Adherence, Model Refusal, Task Completed).
  • Open an individual turn to view baseline metrics, driver prompt, input data, and expected result side-by-side.

Results Overview

Individual Turn Detail

(Optional) Create your own building blocks

While the defaults are great for exploration, you can create and reuse your own components:

You can mix your custom items with the defaults at any time when starting a Simulation.

Advanced Topics

Adversarial Simulations & Tool‑Call Testing

  • Add multiple rows to a Scenario that intentionally poke at edge cases (e.g. jailbreak attempts, bad‑actor personas).
  • Use the Custom Endpoint Target to exercise your entire agent pipeline, including RAG, calls to vector DBs, or function‑calling chains.
  • Combine with out-of-the-box Checks or custom checks you create.

SDK Helpers & Automation

Prompt-Based vs. Custom-Endpoint Flow

Prompt-BasedCustom Endpoint
Where logic livesModel prompt onlyYour HTTP service
Ideal forRapid iteration, early prototypingComplex RAG or tool-calling pipelines