Okareo Python SDK
Okareo's Python SDK helps you trap errors and evaluate large language models (LLMs), agents, or RAG in both structured test scenarios and real-world usage. It provides a unified way to collect model telemetry (inputs, outputs, errors) and run evaluations to measure performance and reliability.
Use Python SDK for integrating Okareo cloud-based or self-hosed service into your applications. The SDK lets you register and test your AI code (including LLMs, agents, and embedding models), define scenarios for evaluation (single-turn or multi-turn conversations, classification and retrieval), log real-time datapoints from your app, and use built-in or custom metrics (Checks) to evaluate performance.
Install and Authenticate
Installing the SDK: Okareo is available on PyPI. You can install it with pip as follows:
pip install okareo
After installing, you'll need to obtain an API token from Okareo. Sign up for a free account on the Okareo web app and generate an API Token. Set this token as an environment variable so the SDK can authenticate:
export OKAREO_API_KEY="<YOUR_TOKEN>"
Alternatively, you can pass the API key directly in code when initializing the client (see below).
Authenticating the Okareo client: In your Python code, import the Okareo
class and create an instance, providing your API key. For example:
from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
This instantiation establishes a client that will communicate with Okareo's cloud API. You set the OKAREO_API_KEY
env variable, and pass the key as a parameter.
Verify
You can verify your installation and API connectivity with this simple test:
import os
from okareo import Okareo
# Initialize client (using environment variable)
okareo = Okareo(os.environ["OKAREO_API_KEY"])
print(f"✅ Installation verified! Projects for this account: {okareo.get_projects()}")
Using the Okareo SDK
Once Okareo client is authenticated, you can use the SDK to register models/agents, define evaluation scenarios, run tests, and log data. Below are common usage patterns with code examples.
Defining Scenarios for Evaluation
A Scenario Set in Okareo represents a dataset or set of test cases (inputs and expected outputs) to evaluate your application task on. You can create scenarios programmatically using the SDK:
Prepare seed data: Each scenario data point consists of an input
(e.g. a prompt or query) and an expected result
(e.g. the correct answer or ideal response). The SDK provides a SeedData
model class to structure these. For convenience, you can use a Python list of dictionaries and convert it to SeedData
objects using Okareo.seed_data_from_list(...)
. For example:
from okareo_api_client.models import ScenarioSetCreate, SeedData
# Define a list of test cases as dicts
seed_items = [
{"input": "Capital of France?", "result": "Paris"},
{"input": "5 + 7 =", "result": "12"}
]
# Convert to SeedData objects
seed_data = Okareo.seed_data_from_list(seed_items)
(The above returns a list of okareo_api_client.models.SeedData
– this class is part of the SDK and defines the schema for inputs/results. The SDK's seed_data_from_list
method does the conversion to a list SeedData
.)
Create a scenario set: Once you have a list of SeedData
, call create_scenario_set
on the Okareo
client. You'll pass a ScenarioSetCreate
object with a name and the seed data list:
from okareo_api_client.models import ScenarioSetCreate
scenario_req = ScenarioSetCreate(
name="My Evaluation Scenario",
seed_data=seed_data
)
scenario_set = okareo.create_scenario_set(scenario_req)
print(scenario_set.app_link)
In this example, scenario_set
will be a ScenarioSetResponse
containing a unique ScenarioSetResponse.scenario_id
that identifies this set and ScenarioSetResponse.app_link
- a link to newly created scenario in Okareo web app. You can create scenario sets for different purposes – e.g. a set of questions for a Question/Answer model task, or a set of conversation turns for a chatbot. Okareo allows using these scenarios both for driving evaluations and as seeds for synthetic scenario generation (more on generation later).
Running Evaluations – GenerationModel
GenerationModel provides a standard interface to evaluate text-generation models (LLMs) in Okareo. You define the model by specifying its identifier and parameters (like temperature
), register it with Okareo, create a scenario set of test prompts and expected results, then run an evaluation:
import os
from okareo.model_under_test import GenerationModel
from okareo_api_client.models import ScenarioSetCreate # class for scenario set creation
from okareo_api_client.models import TestRunType # types of evaluations that could be run (Classification/Retrieval/Text Generation)
okareo = Okareo(os.environ["OKAREO_API_KEY"])
# 1. Define a generation model (e.g., LLM with a given model ID and parameters)
model = GenerationModel(model_id="gpt-4o",
temperature=0.7,
system_prompt_template="{input}") # in this template {input} pulls from scenario set
# 2. Register the model with Okareo (returns a ModelUnderTest instance)
mut = okareo.register_model(name="LLM under evaluation", model=model) # ModelUnderTest for the registered model
# 3. Create a scenario set with input-output pairs for evaluation
seed_data = Okareo.seed_data_from_list([
{"input": "What is 2+2?", "result": "4"},
{"input": "Hello", "result": "Hi"}
]) # Convert list of dicts to SeedData objects
scenario_request = ScenarioSetCreate(name="Basic Q&A", seed_data=seed_data)
scenario_set = okareo.create_scenario_set(scenario_request) # Create scenario set in Okareo
# 4. Run an evaluation on the scenario set
test_run = mut.run_test(scenario=scenario_set,
name="Example Eval Run",
api_key=os.environ["OPENAI_API_KEY"],
test_run_type=TestRunType.NL_GENERATION, # text-generation evaluation
checks=["reference_similarity"])
print(test_run.app_link)
In the above example, we define a generation model with a model_id
and temperature
(you can also set prompt templates if needed). We register it via register_model()
– providing a name and the GenerationModel
object – which returns a ModelUnderTest
handle tied to that model. We then prepare a Scenario Set of two simple Q&A pairs. The helper Okareo.seed_data_from_list()
converts a list of {"input": ..., "result": ...}
mappings into the required SeedData
format. Using okareo.create_scenario_set(...)
, we upload this scenario set to the Okareo platform. Finally, calling mut.run_test(...)
executes the model on each scenario in the set and returns a TestRunItem
result (which includes metrics and outcomes). This allows you to automatically evaluate the model's responses against expected results.
Obtaining results: If run_test
is called synchronously it will usually wait for completion (the HTTP call will return when the evaluation is done). You can also log in to the Okareo web app – each test run is visible there with detailed reports (the TestRunItem.app_link
field gives a direct URL to view the run in the app).
Okareo supports many providers through GenerationModel
(OpenAI, Cohere, Anthropic, etc.). The register_model()
call returns a model handle we'll use for running tests. (Model names must be unique within your project.)
Running Evaluations – CustomModel
and ModelInvocation
Okareo also supports evaluating custom models – models not covered by built-in providers – by subclassing CustomModel
. You implement the invoke()
method to call your model and return a ModelInvocation
result. This lets you integrate any arbitrary model (e.g from Huggingface) or logic into the Okareo evaluation framework:
from okareo.model_under_test import CustomModel, ModelInvocation
# Define a custom model by subclassing CustomModel
class MyModel(CustomModel):
def __init__(self):
super().__init__(name="MyModel") # set a name for the custom model
def invoke(self, input_value):
# Simple echo model: returns the input as the "prediction"
return ModelInvocation(
model_prediction=input_value,
model_input=input_value
) # wrap the output and input in a ModelInvocation object
# Register the custom model
model = MyModel()
mut = okareo.register_model(name="Custom Model", model=model)
# Execute eval run on custom model and (Re)use a scenario_set as defined earlier
test_run = mut.run_test(scenario=scenario_set,
name="Custom Test",
test_run_type=TestRunType.NL_GENERATION, # text-generation evaluation
checks=["reference_similarity"])
In this example, MyModel
inherits from CustomModel
and implements invoke(self, input_value)
. Inside invoke
, we call our actual model logic – here we simply echo the input – and return a ModelInvocation
object containing the prediction and any relevant metadata. The Okareo SDK uses this to handle custom model outputs uniformly. We then register MyModel
with register_model()
(just like a built-in model) and run a test on a given scenario set. The process for creating or reusing scenario sets and executing run_test
is identical, making custom models first-class citizens in the evaluation workflow. This way, you can evaluate proprietary models or algorithms with the same scenario-based approach and metrics as other models.
Classification
The Okareo Python SDK makes it easy to set up and evaluate classification tasks—where the goal is to assign input text to one of several predefined categories. A common use case is intent detection in Retrieval-Augmented Generation (RAG) systems, where identifying the user’s intent helps retrieve more relevant context and improves response quality.
To define a classification task, you provide instructions that describe the input format, output format, and list of valid categories—like "Support," "Returns," "Membership," or "Sustainability." You can run these tasks with pre-trained models or finetune your own using the SDK to get better results on your specific data. This capability allows creation of highly specialized classification models tailored to specific application needs.
The SDK also includes built-in tools to evaluate classification accuracy and performance directly in the Okareo platform.
Here’s a minimal example to get started:
import os
from okareo import Okareo
from okareo.model_under_test import CustomModel, ModelInvocation
from okareo_api_client.models import ScenarioSetCreate, TestRunType
okareo = Okareo(os.environ["OKAREO_API_KEY"])
# 1. Define a classification model that returns a label for each input
# (subclass CustomModel as simple example, but could be GenerationModel, huggingface model etc)
class MyClassifier(CustomModel):
def invoke(self, input_text: str):
# Simple rule-based classifier for illustration
if "return" in input_text.lower():
result_label = "returns"
elif "how much" in input_text.lower():
result_label = "pricing"
else:
result_label = "complaints"
return ModelInvocation(
model_prediction=result_label,
model_input=input_text,
model_output_metadata={"extracted_entities": {"return": True, "amount": 23}} # example use is query extraction,etc
)
# 2. Register the custom classification model with Okareo
mut = okareo.register_model(name="Intent Classifier", model=MyClassifier(name="ClassifierModel"))
# 3. Create a scenario set with input texts and expected labels
seed_data = Okareo.seed_data_from_list([
{"input": "I want to send this product back", "result": "returns"},
{"input": "How much does this product cost?", "result": "pricing"}
])
scenario_set = okareo.create_scenario_set(ScenarioSetCreate(name="Intent Test Set", seed_data=seed_data))
# 4. Run the classification evaluation on the scenario set
test_run = mut.run_test(scenario=scenario_set,
name="Classification Test Run",
test_run_type=TestRunType.MULTI_CLASS_CLASSIFICATION)
print(test_run.app_link)
In the above example, we define a simple classification model by subclassing CustomModel
and implementing its invoke
method to return a category label for each input. We register this custom model and set up a scenario set with some example inputs and their expected labels. Finally, we call mut.run_test(...)
with the scenario set to evaluate the classifier’s predictions against the expected results. Okareo will automatically compute classification metrics (accuracy, F1, recall, precision) and confusion matrix for the run. The returned TestRunItem (test_run)
contains these metrics and an app_link
to view the detailed results in the Okareo web app.
Retrieval
SDK lets you build and evaluate retrieval systems—used to perform nearest neighbor search on a collection of vectors and fetch relevant context from a dataset based on a query. A typical use case is comparing sparse (like SPLADE) vs. dense (like e5) vs. hybrid embedding methods to see which performs better on your data.
You can bring your own embedding and reranker models to better fit your domain—and evaluate them directly in the SDK. Okareo also lets you fine-tune these models to get the best performance on your data.
Besides flexibility on embedding models, you can leverage Okareo standard connectors with QDrant, Pinecone, or any vector DB/store of your choice. Use a CustomModel
to wrap a vector DB like ChromaDB. Okareo evaluates retrievers with Information Retrieval metrics like Accuracy@k
, Precision@k
, Recall@k
, MAP@k
, MRR@k
, and NDCG@k
.
import chromadb
from okareo import Okareo
from okareo.model_under_test import CustomModel, ModelInvocation
from okareo_api_client.models import ScenarioSetCreate, TestRunType
okareo = Okareo(os.environ["OKAREO_API_KEY"])
# Set up vector DB
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="retrieval_test", metadata={"hnsw:space": "cosine"}, get_or_create=True)
# !!! Add documents to the ChromaDB collection before running. !!!
# Define model
class RetrievalModel(CustomModel):
# A function to convert the query results from our ChromaDB collection
# into a list of dictionaries with the document ID, score, metadata, and label
@staticmethod
def query_results_to_score(results):
parsed_ids_with_scores = []
for i in range(0, len(results['distances'][0])):
# Create a score based on cosine similarity
score = (2 - results['distances'][0][i]) / 2
parsed_ids_with_scores.append(
{
"id": results['ids'][0][i],
"score": score,
"metadata": results['documents'][0][i],
"label": f"Doc w/ ID: {results['ids'][0][i]}"
}
)
return parsed_ids_with_scores
def invoke(self, input: str) -> ModelInvocation:
results = collection.query(query_texts=[input], n_results=5)
# Return formatted query results and the model response context
return ModelInvocation(model_prediction=RetrievalModel.query_results_to_score(results), model_output_metadata={'model_data': input})
# Register model
mut = okareo.register_model(name="vectordb_retrieval_test", model=RetrievalModel(name="retrieval"))
# Create scenario set
seed_data = Okareo.seed_data_from_list([
{"input": "What is Okareo?", "result": ["doc123"]},
{"input": "How do I install Okareo?", "result": ["doc456"]}
])
scenario = okareo.create_scenario_set(ScenarioSetCreate(name="Retrieval Test", seed_data=seed_data))
# Define thresholds for the evaluation metrics
at_k_intervals = [1, 3, 5]
# Run test
test_run = mut.run_test(
scenario=scenario,
name="Retrieval Run",
test_run_type=TestRunType.INFORMATION_RETRIEVAL,
# Define the evaluation metrics to calculate
metrics_kwargs={
"accuracy_at_k": at_k_intervals ,
"precision_recall_at_k": at_k_intervals ,
"ndcg_at_k": at_k_intervals,
}
)
print(test_run.app_link)
Use 'metrics_kwargs' to customize @k intervals.
You can see additional retrieval examples here:
-
Retrieval Evaluation Example: Basic retrieval evaluation notebook
-
Cohere & Pinecone Retrieval Example: Retrieval evaluation using Cohere embeddings and Pinecone vector store
-
Embedding Model Comparison: Compare different embedding models for retrieval tasks
Built-In Checks and Custom Checks
What are Checks? In Okareo, Checks are automatic evaluation metrics or constraints that analyze your model’s outputs. A check can score an output on some criterion or enforce that it meets certain requirements. For example, you might use checks to verify that a generated answer is factually correct, not too long, or similar enough to an expected reference answer. Each check runs after the model produces an output and returns either a numeric score or a pass/fail boolean indicating whether the output satisfies the check. This lets you evaluate or constrain model outputs programmatically as part of your test runs.
Built-in Checks and Using Checks in run_test
Okareo provides many built-in checks (metrics) that cover common evaluation criteria – from language quality (e.g. fluency, coherence) to correctness and relevance. For instance, reference_similarity
is a built-in check that measures how closely the model’s output matches a reference answer (useful for QA, summarization, or translation tasks with a ground-truth result). Other examples include context_consistency
, does_code_compile
(checks if generated code is syntactically correct), levenshtein_distance
(measures edit distance to a reference), relevance
/consistency
(LLM-based relevance or consistency scoring), and more. You can retrieve the full list of available checks by calling okareo.get_all_checks()
.
To use these checks during evaluation, pass their names to the checks
parameter in run_test()
. The SDK will automatically calculate each specified check for every output. For example, to evaluate a generative model on a scenario with both a reference_similarity metric and a context_consistency check:
from okareo_api_client.models import TestRunType
evaluation = model_under_test.run_test(
name="My Evaluation",
scenario=my_scenario, # ScenarioSetResponse object or UUID
test_run_type=TestRunType.NL_GENERATION,
checks=['reference_similarity', 'context_consistency'] # list of check names to run
)
Each check will produce a score or pass/fail result for every test case. You can view these check results in the returned evaluation
object or in the Okareo web app. For instance, the reference_similarity
score might be high if the output closely matches the expected result. Using built-in checks in this way makes it easy to track your model’s behavior across various dimensions without writing any custom code.
Defining Custom Checks
In addition to the off-the-shelf checks, Okareo allows you to define custom checks tailored to your specific needs. There are two primary ways to create custom checks:
- Code-based checks – you write Python code (logic, heuristics, regex, etc.) to evaluate the output deterministically.
- Model-based checks – you leverage another language model to judge the output via a prompt template (useful for subjective or complex criteria).
Both approaches use the okareo.create_or_update_check()
method to register the new check with Okareo so it can be used in evaluations. Below, we walk through full examples of each method.
Code-Based Custom Checks
A CodeBasedCheck is a custom metric implemented in pure code. This is useful for straightforward or deterministic evaluations – for example, checking if the output JSON is valid, or if an answer contains a certain keyword. To create a code-based check:
- Create a Python file (e.g.
my_custom_check.py
) in your project (outside of a notebook environment). - Define a class named
Check
in this file that inherits fromokareo.checks.CodeBasedCheck
. - Implement the
evaluate
method in your class with the logic to score or validate outputs. - Include any additional imports or helper functions in the same file as needed for your check logic.
The evaluate
method of a CodeBasedCheck should be a @staticmethod
(no self
parameter) that accepts three arguments – model_output
, scenario_input
, and scenario_result
. These correspond to the model’s generated output, the scenario’s input prompt, and the scenario’s expected result (if any). The method should return either a boolean (for pass/fail checks) or a number (int
or float
for scoring checks). Generally, returning a boolean will treat the check as a Pass/Fail metric, while returning a numeric value will treat it as a scored metric.
Example – CodeBasedCheck: Suppose we want a custom check that passes if the model’s output contains more than 10 words (a simple length check). First, create a file my_custom_check.py
with the following content:
# In my_custom_check.py
from okareo.checks import CodeBasedCheck
class Check(CodeBasedCheck):
@staticmethod
def evaluate(model_output: str, scenario_input: str, scenario_result: str, metadata: dict) -> bool:
# Custom logic: check if output has more than 10 words
word_count = len(model_output.split())
return word_count > 10 # Returns True if output length > 10 words (Pass/Fail)
After defining the check class, register it with Okareo using the SDK. This will upload your custom check code to Okareo (so it can run in the evaluation service) or update it if the name already exists:
# Register (or update) the custom code-based check with Okareo
check_long_output = okareo.create_or_update_check(
name="check_long_output", # unique name for the check
description="Check if output has more than 10 words",
check=Check() # instantiate our custom Check class
)
Here, create_or_update_check
takes a name, an optional description, and an instance of our Check
class. Under the hood, the SDK will package the my_custom_check.py
code (including the Check.evaluate
logic) and send it to the Okareo platform. Once created, you can use "check_long_output"
like any other check name in run_test(checks=[...])
. For example, checks=['reference_similarity', 'check_long_output']
would apply both the built-in reference similarity metric and our custom length check to each model output.
Note: Ensure that your custom check file only contains one Check
class and that all necessary helper code is included in that file. The Okareo service will execute your evaluate
function in a sandboxed environment using the provided code.
Model-Based Custom Checks
A ModelBasedCheck uses a prompt and a scoring LLM to evaluate the output. This is handy for complex or subjective evaluations – e.g. “Is the answer polite?”, “Does the output correctly answer the question?” – where you can’t easily write a deterministic rule. Instead, you provide a prompt template that a judge model will complete to produce a score or verdict.
When creating a ModelBasedCheck, you specify a prompt_template
and a check_type
indicating the expected output format of the evaluation. The prompt template is a string that can include placeholders for the input, result, and generation:
- {model_input} -> corresponds to the model's input
- {generation} -> will be replaced with the model's output (the generation to be evaluated)
- {scenario_input} -> will be replaced with the scenario's input prompt
- {scenario_result} -> will be replaced with the scenario's expected result (if provided; often used for reference-based checks)
Your template should incorporate one or more of these placeholders and instruct the evaluating model to produce the desired judgment. The check_type
tells Okareo what kind of response to expect from the prompt:
CheckOutputType.SCORE
– the prompt will yield a numeric score (e.g. a rating like 1–5 or a percentage). The evaluation model should output only a number in its response.CheckOutputType.PASS_FAIL
– the prompt will yield a boolean outcome (True/False or Pass/Fail). The model should output a clear true/false result (not case-sensitive).
Example – ModelBasedCheck: Suppose we want to measure how many words the output contains, but instead of coding it directly, we’ll use an LLM to count. (This is a contrived example – you’d normally do this with a CodeBasedCheck, but it illustrates the pattern.) We’ll ask the LLM to output just the number of words in the given text. We define the check as follows:
from okareo.checks import ModelBasedCheck, CheckOutputType
check_word_count = okareo.create_or_update_check(
name="check_word_count",
description="Use an LLM to count number of words in output",
check=ModelBasedCheck(
prompt_template="Only output the number of words in the following text:\n{generation}",
check_type=CheckOutputType.SCORE
)
)
In this call, we pass an instance of ModelBasedCheck
to create_or_update_check
:
- The
prompt_template
string instructs the evaluator model: it will be filled with the model’s output ({generation}
) and asks for the number of words. We do not include{input}
or{result}
here because this check only cares about the generated text itself. - We set
check_type=CheckOutputType.SCORE
because we expect a numeric answer (the count). If instead we wanted a yes/no style evaluation (e.g. “Does the answer contain the keyword ‘apple’?”), we would useCheckOutputType.PASS_FAIL
and craft the prompt to elicit a True/False reply.
When this check runs as part of a test, Okareo will internally call an evaluator model with the given prompt template. The placeholders are substituted with the actual scenario data for each test case. The LLM’s response (just a number in this example) is parsed and returned as the check’s result. You can then see this metric like any other check in the evaluation results. Model-based checks essentially let you plug in an LLM “judge” to grade your model’s outputs on arbitrary criteria, without writing custom code.
Using Metadata in Custom Checks
Custom checks can also access additional information through a metadata
dictionary, passed directly into your evaluate
method. This metadata can include details like the model’s latency (milliseconds), tool calls, token usage, or other runtime information.
Here's a simplified example:
from okareo.checks import CodeBasedCheck
class Check(CodeBasedCheck):
@staticmethod
def evaluate(model_output: str, scenario_input: str, scenario_result: str, metadata: dict) -> bool:
# Check if latency is under 1000 ms
return metadata.get('latency', float('inf')) < 1000
Example of metadata
:
# Example metadata passed into an evaluate method
{
"latency": 342, # The model took 342 milliseconds to generate the output
"tool_calls": [], # No external tools were called by the model (empty list)
"input_tokens": 32 # Token counts (if applicable)
}
You could, for example, incorporate the number of tool invocations into your metric (perhaps penalize outputs that required too many tool calls), or enforce performance thresholds as shown.
With custom checks (whether code-based or model-based) created and registered, you can include them by name in your run_test(checks=[...])
just like built-in checks. This makes it easy to extend Okareo’s evaluation suite with your own domain-specific metrics and have them evaluated consistently across all your LLM tests. The Okareo platform will handle executing your check logic for each output and aggregating the results, so you can focus on interpreting the metrics and improving your LLM app.
Runtime Error Tracking and Evaluations
In addition to running structured evaluations, the Okareo SDK lets you log data from your running application for error tracking or feedback loops. If you have an app in production (or in a notebook) and want to capture each prompt/response pair, you can use the ModelUnderTest.add_data_point()
method. This will send an input-output record to Okareo (associated with the model/agent you registered) so that Okareo can store it and evaluate it with Checks
in the background.
For example, use add_data_point()
to synchronously log a model interaction, and add_data_point_async()
to send it without blocking:
# Assume `mut` is a ModelUnderTest obtained after model registration
# Log a successful prediction data point
mut.add_data_point(
input_obj="Hello, world",
result_obj="Hi there!"
) # Records an input/result pair
# Asynchronous logging (non-blocking)
mut.add_data_point_async(
input_obj="test input",
error_message="Quota exceeded",
error_code="429"
) # Queues the data point to be sent in background
Each data point can include an input_obj
and result_obj
(each can be a dict
, list
, or str
), plus optional context like error_message
and error_code
. The asynchronous variant accepts the same fields and returns immediately (the SDK handles sending in a background). These runtime logs help track errors or evaluate model outputs in production.
Multi-Turn (Dialog) Evaluations:
Okareo supports running full conversational evaluations via the MultiTurnDriver
construct. This means you can evaluate not just single question-answer pairs, but interactive conversations and even agent-to-agent dialogues. This is a powerful capability that goes beyond typical one-shot prompts – you can simulate a workflow of interactions. If your system has multiple agents that talk to each other, you could create a scenario where their messages are passed through Okareo’s driver/target mechanism for evaluation.
You can configure a MultiTurnDriver
as part of your model definition to handle this. For instance, you might register a model with a MultiTurnDriver
that wraps your target model (the agent) and uses either an LLM or custom logic to simulate the user's side of the conversation. The driver can be configured with parameters like number of turns, who speaks first, and a StopConfig
to decide when to end the dialog. Once such an agent is registered (with type="driver" under the hood), calling run_test will execute the conversation workflow. This allows you to test agent behavior over multiple interactions, not just single question-answer pairs, which is essential for debugging agents that use memory, tools, or multi-step workflows.
Concept
A MultiTurnDriver
is composed of:
- Driver: Simulates a user or external agent.
- Target: The model/agent under evaluation.
They alternate turns in a simulated conversation, which is run via run_test
.
Setup
from okareo.model_under_test import MultiTurnDriver, GenerationModel, StopConfig
driver = MultiTurnDriver(
driver_temperature=1.0,
max_turns=3,
repeats=1,
first_turn="target",
target=GenerationModel( # or CustomMultiturnTarget
model_id="gpt-4o",
temperature=0.7,
system_prompt_template="You are a customer service agent for WebBizz...",
tools=[
{
"type": "function",
"function": {
"name": "delete_account",
"description": "Deletes the user's account",
# parameter schema here
},
}
] # optional
),
stop_check=StopConfig(
check_name="task_completed",
stop_on=True
)
)
Driver Parameters
Parameter | Description |
---|---|
driver_temperature | Controls randomness of user/agent simulation |
max_turns | Max back-and-forth messages |
repeats | Repeats each test row to capture variance |
first_turn | "driver" or "target" starts conversation |
stop_check | Defines stopping condition (via check) |
Target
The Target is either:
- A
GenerationModel
- A
CustomMultiturnTarget
(custom logic with message history input)
Note: You must define a
system_prompt_template
when using aMultiTurnDriver
.
Creating a Scenario for Driver
Use SeedData
to define different types of Driver behavior:
from okareo_api_client.models import ScenarioSetCreate, SeedData
seeds = [
SeedData(
input_="You are interacting with a customer service agent. First, ask about WebBizz...",
result="N/A",
)
]
scenario_set_create = ScenarioSetCreate(
name=f"MultiTurn Conversation",
seed_data=seeds
)
scenario = okareo.create_scenario_set(scenario_set_create)
The input_
is passed to the Driver and should instruct it how to behave during the test.
Creating a Scenario for Driver
Function Calling + Tool Mocking
Your model can call tools using the OpenAI tool calling format:
tools = [
{
"type": "function",
"function": {
"name": "delete_account",
"description": "Deletes the user's account",
# parameters
},
}
]
You can instruct Driver to mock tool results via the Driver prompt:
If you receive any function calls, respond as if the account was deleted successfully in JSON.
Conversation Control & Checks
First create the check with:
from okareo.checks import CheckOutputType, ModelBasedCheck
okareo.create_or_update_check(
name="task_completion_delete_account",
description="Check if the agent confirms account deletion",
check=ModelBasedCheck( # or any Check subclass
prompt_template=(
"The task is complete if the output confirms that account deletion was successful. "
"Return True for if the task is completed, False otherwise. "
"Here is the output to check: {model_output}"
),
check_type=CheckOutputType.PASS_FAIL,
),
)
Use StopConfig
to end the test when a condition is met:
stop_check=StopConfig(
check_name="task_completion_delete_account",
stop_on=True
)
For multi-turn Checks can evaluate:
- Task completion
- Tone
- Accuracy
- Guideline adherence
- etc.
Once Driver, Target, and Driver scenario is ready you can register and run you multi-turn evaluation:
from okareo_api_client.models.test_run_type import TestRunType
multiturn_model = okareo.register_model(
name="Demo MultiTurnDriver",
model=driver,
update=True,
)
evaluation = multiturn_model.run_test(
scenario=scenario,
name="Multi-turn Demo Evaluation",
api_key=os.environ["OPENAI_API_KEY"],
test_run_type=TestRunType.NL_GENERATION
)
print(f"See results in Okareo app: {evaluation.app_link}")
End-to-End Examples: You can find end to end notebook examples with tools and CustomMultiturnTarget
in this directory
Scenario Generation:
Okareo can generate new scenarios (variations of your test data) using LLMs. The SDK provides okareo.generate_scenarios(source_scenario, name, number_examples, generation_type)
which leverages an LLM to rephrase or create new input variations from an existing scenario set. For example, using generation_type=ScenarioType.REPHRASE_INVARIANT
will create paraphrased inputs that should yield the same expected result. This is useful for robustness testing – you can take a few seed Q&A pairs and expand them into dozens of semantically similar but varied questions automatically.