Skip to main content

Introduction to Evaluation

Okareo Evaluations helps you trap errors and evaluate large language models (LLMs), agents, or RAG in both structured test scenarios and real-world usage. It provides a unified way to collect model telemetry (inputs, outputs, errors) and run evaluations to measure performance and reliability.

Use the Python or TypeScript SDK to integrate Okareo (cloud-based or self‑hosted) into your applications. The SDK lets you register and test your AI code (including LLMs, agents, and embedding models), define scenarios for evaluation (single‑turn or multi‑turn conversations, classification, and retrieval), log real‑time datapoints from your app, and use built‑in or custom metrics (Checks) to evaluate performance.

Install and Authenticate

Installing the SDK:

pip install okareo

After installing, obtain an API token from Okareo. Sign up for a free account on the Okareo web app and generate an API Token. Set this token as an environment variable so the SDK can authenticate:

export OKAREO_API_KEY="<YOUR_TOKEN>"

Alternatively, pass the key directly in code when initializing the client.

from okareo import Okareo

okareo = Okareo("<YOUR_TOKEN>")

Verify

You can verify your installation and API connectivity with this simple test:

import os
from okareo import Okareo

okareo = Okareo(os.environ["OKAREO_API_KEY"])
print("✅ Installation verified! Projects for this account:", okareo.get_projects())

Using the Okareo SDK

Once Okareo client is authenticated, you can use the SDK to register models/agents, define evaluation scenarios, run tests, and log data. Below are common usage patterns with code examples.

Defining Scenarios for Evaluation

A Scenario Set in Okareo represents a dataset or set of test cases (inputs and expected outputs) to evaluate your application task on. You can create scenarios programmatically using the SDK:

Prepare seed data: Each scenario data point consists of an input (e.g. a prompt or query) and an expected result (e.g. the correct answer or ideal response). The SDK provides a SeedData model class to structure these. For convenience, you can use a Python list of dictionaries and convert it to SeedData objects using Okareo.seed_data_from_list(...). For example:

from okareo_api_client.models import ScenarioSetCreate, SeedData
from okareo import Okareo

okareo = Okareo(os.environ["OKAREO_API_KEY"])

seed_items = [
{"input": "Capital of France?", "result": "Paris"},
{"input": "5 + 7 =", "result": "12"},
]
seed_data = Okareo.seed_data_from_list(seed_items)

scenario_req = ScenarioSetCreate(name="My Evaluation Scenario", seed_data=seed_data)
scenario_set = okareo.create_scenario_set(scenario_req)
print(scenario_set.app_link)
info

See Data Sets to learn more about Scenarios.

Running Evaluations – GenerationModel

GenerationModel provides a standard interface to evaluate text-generation models (LLMs) in Okareo. You define the model by specifying its identifier and parameters (like temperature), register it with Okareo, create a scenario set of test prompts and expected results, then run an evaluation:

import os
from okareo import Okareo
from okareo.model_under_test import GenerationModel
from okareo_api_client.models import ScenarioSetCreate, TestRunType

okareo = Okareo(os.environ["OKAREO_API_KEY"])

model = GenerationModel(
model_id="gpt-4o",
temperature=0.7,
system_prompt_template="{input}",
)
mut = okareo.register_model(name="LLM under evaluation", model=model)

seed_data = Okareo.seed_data_from_list([
{"input": "What is 2+2?", "result": "4"},
{"input": "Hello", "result": "Hi"},
])
scenario = okareo.create_scenario_set(ScenarioSetCreate(name="Basic Q&A", seed_data=seed_data))

test_run = mut.run_test(
scenario=scenario,
name="Example Eval Run",
api_key=os.environ["OPENAI_API_KEY"],
test_run_type=TestRunType.NL_GENERATION,
checks=["reference_similarity"],
)
print(test_run.app_link)

In the above example, we define a generation model with a model_id and temperature (you can also set prompt templates if needed). We register it via register_model() – providing a name and the GenerationModel object – which returns a ModelUnderTest handle tied to that model. We then prepare a Scenario Set of two simple Q&A pairs. The helper Okareo.seed_data_from_list() converts a list of {"input": ..., "result": ...} mappings into the required SeedData format. Using okareo.create_scenario_set(...), we upload this scenario set to the Okareo platform. Finally, calling mut.run_test(...) executes the model on each scenario in the set and returns a TestRunItem result (which includes metrics and outcomes). This allows you to automatically evaluate the model's responses against expected results.

Obtaining results: If run_test is called synchronously it will usually wait for completion (the HTTP call will return when the evaluation is done). You can also log in to the Okareo web app – each test run is visible there with detailed reports (the TestRunItem.app_link field gives a direct URL to view the run in the app).

Okareo supports many providers through GenerationModel (OpenAI, Cohere, Anthropic, etc.). The register_model() call returns a model handle we'll use for running tests. (Model names must be unique within your project.)

info

See Generation to learn more about Generation Evaluations.

Running Evaluations – CustomModel and ModelInvocation

Okareo also supports evaluating custom models – models not covered by built-in providers – by subclassing CustomModel. You implement the invoke() method to call your model and return a ModelInvocation result. This lets you integrate any arbitrary model (e.g from Huggingface) or logic into the Okareo evaluation framework:

from okareo.model_under_test import CustomModel, ModelInvocation
from okareo_api_client.models import TestRunType

class MyModel(CustomModel):
def __init__(self):
super().__init__(name="MyModel")

def invoke(self, input_value):
return ModelInvocation(model_prediction=input_value, model_input=input_value)

mut = okareo.register_model(name="Custom Model", model=MyModel())

test_run = mut.run_test(
scenario=scenario,
name="Custom Test",
test_run_type=TestRunType.NL_GENERATION,
checks=["reference_similarity"],
)

In this example, MyModel inherits from CustomModel and implements invoke(self, input_value). Inside invoke, we call our actual model logic – here we simply echo the input – and return a ModelInvocation object containing the prediction and any relevant metadata. The Okareo SDK uses this to handle custom model outputs uniformly. We then register MyModel with register_model() (just like a built-in model) and run a test on a given scenario set. The process for creating or reusing scenario sets and executing run_test is identical, making custom models first-class citizens in the evaluation workflow. This way, you can evaluate proprietary models or algorithms with the same scenario-based approach and metrics as other models.

Retrieval

SDK lets you build and evaluate retrieval systems—used to perform nearest neighbor search on a collection of vectors and fetch relevant context from a dataset based on a query. A typical use case is comparing sparse (like SPLADE) vs. dense (like e5) vs. hybrid embedding methods to see which performs better on your data.

You can bring your own embedding and reranker models to better fit your domain—and evaluate them directly in the SDK. Okareo also lets you fine-tune these models to get the best performance on your data.

Besides flexibility on embedding models, you can leverage Okareo standard connectors with QDrant, Pinecone, or any vector DB/store of your choice. Use a CustomModel to wrap a vector DB like ChromaDB. Okareo evaluates retrievers with Information Retrieval metrics like Accuracy@k, Precision@k, Recall@k, MAP@k, MRR@k, and NDCG@k.

import chromadb
from okareo import Okareo
from okareo.model_under_test import CustomModel, ModelInvocation
from okareo_api_client.models import ScenarioSetCreate, TestRunType

okareo = Okareo(os.environ["OKAREO_API_KEY"])

chroma_client = chromadb.Client()
collection = chroma_client.create_collection(
name="retrieval_test",
metadata={"hnsw:space": "cosine"},
get_or_create=True,
)

# !!! Add documents to collection before running !!!

class RetrievalModel(CustomModel):
@staticmethod
def to_score(results):
parsed = []
for i in range(len(results["distances"][0])):
score = (2 - results["distances"][0][i]) / 2
parsed.append({
"id": results["ids"][0][i],
"score": score,
"metadata": results["documents"][0][i],
"label": f"Doc w/ ID: {results['ids'][0][i]}",
})
return parsed

def invoke(self, input: str):
results = collection.query(query_texts=[input], n_results=5)
return ModelInvocation(
model_prediction=RetrievalModel.to_score(results),
model_output_metadata={"model_data": input},
)

mut = okareo.register_model(name="vectordb_retrieval_test", model=RetrievalModel(name="retrieval"))

seed_data = Okareo.seed_data_from_list([
{"input": "What is Okareo?", "result": ["doc123"]},
{"input": "How do I install Okareo?", "result": ["doc456"]},
])
scenario = okareo.create_scenario_set(ScenarioSetCreate(name="Retrieval Test", seed_data=seed_data))

at_k = [1, 3, 5]

test_run = mut.run_test(
scenario=scenario,
name="Retrieval Run",
test_run_type=TestRunType.INFORMATION_RETRIEVAL,
metrics_kwargs={
"accuracy_at_k": at_k,
"precision_recall_at_k": at_k,
"ndcg_at_k": at_k,
},
)
print(test_run.app_link)

Use metrics_kwargs to customize @k intervals.

You can see additional retrieval examples here:

info

See Retrieval to learn more about Retrieval Evaluations.

Classification

The Okareo Python SDK makes it easy to set up and evaluate classification tasks—where the goal is to assign input text to one of several predefined categories. A common use case is intent detection in Retrieval-Augmented Generation (RAG) systems, where identifying the user's intent helps retrieve more relevant context and improves response quality.

To define a classification task, you provide instructions that describe the input format, output format, and list of valid categories—like "Support," "Returns," "Membership," or "Sustainability." You can run these tasks with pre-trained models or finetune your own using the SDK to get better results on your specific data. This capability allows creation of highly specialized classification models tailored to specific application needs.

The SDK also includes built-in tools to evaluate classification accuracy and performance directly in the Okareo platform.

Here's a minimal example to get started:

import os
from okareo import Okareo
from okareo.model_under_test import CustomModel, ModelInvocation
from okareo_api_client.models import ScenarioSetCreate, TestRunType

okareo = Okareo(os.environ["OKAREO_API_KEY"])

class MyClassifier(CustomModel):
def invoke(self, input_text: str):
input_lower = input_text.lower()
if "return" in input_lower:
label = "returns"
elif "how much" in input_lower:
label = "pricing"
else:
label = "complaints"

return ModelInvocation(
model_prediction=label,
model_input=input_text,
model_output_metadata={
"extracted_entities": {"return": True, "amount": 23}
},
)

mut = okareo.register_model(name="Intent Classifier", model=MyClassifier(name="ClassifierModel"))

seed_data = Okareo.seed_data_from_list([
{"input": "I want to send this product back", "result": "returns"},
{"input": "How much does this product cost?", "result": "pricing"},
])
scenario_set = okareo.create_scenario_set(ScenarioSetCreate(name="Intent Test Set", seed_data=seed_data))

test_run = mut.run_test(
scenario=scenario_set,
name="Classification Test Run",
test_run_type=TestRunType.MULTI_CLASS_CLASSIFICATION,
)
print(test_run.app_link)

In the above example, we define a simple classification model by subclassing CustomModel and implementing its invoke method to return a category label for each input. We register this custom model and set up a scenario set with some example inputs and their expected labels. Finally, we call mut.run_test(...) with the scenario set to evaluate the classifier's predictions against the expected results. Okareo will automatically compute classification metrics (accuracy, F1, recall, precision) and confusion matrix for the run. The returned TestRunItem (test_run) contains these metrics and an app_link to view the detailed results in the Okareo web app.

info

See Classification to learn more about Classification Evaluations.