Introduction to Evaluation
Okareo Evaluations helps you trap errors and evaluate large language models (LLMs), agents, or RAG in both structured test scenarios and real-world usage. It provides a unified way to collect model telemetry (inputs, outputs, errors) and run evaluations to measure performance and reliability.
Use the Python or TypeScript SDK to integrate Okareo (cloud-based or self‑hosted) into your applications. The SDK lets you register and test your AI code (including LLMs, agents, and embedding models), define scenarios for evaluation (single‑turn or multi‑turn conversations, classification, and retrieval), log real‑time datapoints from your app, and use built‑in or custom metrics (Checks) to evaluate performance.
Install and Authenticate
Installing the SDK:
- Python
- TypeScript
pip install okareo
After installing, obtain an API token from Okareo. Sign up for a free account on the Okareo web app and generate an API Token. Set this token as an environment variable so the SDK can authenticate:
export OKAREO_API_KEY="<YOUR_TOKEN>"
Alternatively, pass the key directly in code when initializing the client.
from okareo import Okareo
okareo = Okareo("<YOUR_TOKEN>")
npm install -D okareo-ts-sdk # or: yarn add -D okareo-ts-sdk
After installing, obtain an API token from Okareo. Sign up for a free account on the Okareo web app and generate an API Token. Set this token as an environment variable so the SDK can authenticate:
export OKAREO_API_KEY="<YOUR_TOKEN>"
Alternatively, pass the key directly in code when initializing the client.
import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({ api_key: "<YOUR_TOKEN>" });
Verify
You can verify your installation and API connectivity with this simple test:
- Python
- TypeScript
import os
from okareo import Okareo
okareo = Okareo(os.environ["OKAREO_API_KEY"])
print("✅ Installation verified! Projects for this account:", okareo.get_projects())
import { Okareo } from "okareo-ts-sdk";
(async () => {
const okareo = new Okareo({ api_key: process.env.OKAREO_API_KEY! });
const projects = await okareo.getProjects();
console.log("✅ Installation verified! Projects for this account:", projects);
})();
Using the Okareo SDK
Once Okareo client is authenticated, you can use the SDK to register models/agents, define evaluation scenarios, run tests, and log data. Below are common usage patterns with code examples.
Defining Scenarios for Evaluation
A Scenario Set in Okareo represents a dataset or set of test cases (inputs and expected outputs) to evaluate your application task on. You can create scenarios programmatically using the SDK:
Prepare seed data: Each scenario data point consists of an input
(e.g. a prompt or query) and an expected result
(e.g. the correct answer or ideal response). The SDK provides a SeedData
model class to structure these. For convenience, you can use a Python list of dictionaries and convert it to SeedData
objects using Okareo.seed_data_from_list(...)
. For example:
- Python
- TypeScript
from okareo_api_client.models import ScenarioSetCreate, SeedData
from okareo import Okareo
okareo = Okareo(os.environ["OKAREO_API_KEY"])
seed_items = [
{"input": "Capital of France?", "result": "Paris"},
{"input": "5 + 7 =", "result": "12"},
]
seed_data = Okareo.seed_data_from_list(seed_items)
scenario_req = ScenarioSetCreate(name="My Evaluation Scenario", seed_data=seed_data)
scenario_set = okareo.create_scenario_set(scenario_req)
print(scenario_set.app_link)
import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({ api_key: process.env.OKAREO_API_KEY! });
const seedData = [
{ input: "Capital of France?", result: "Paris" },
{ input: "5 + 7 =", result: "12" },
];
const scenarioSet = await okareo.create_scenario_set({
name: "My Evaluation Scenario",
project_id: "your-project-id",
seed_data: seedData,
});
console.log(scenarioSet.app_link);
See Data Sets to learn more about Scenarios.
Running Evaluations – GenerationModel
GenerationModel provides a standard interface to evaluate text-generation models (LLMs) in Okareo. You define the model by specifying its identifier and parameters (like temperature
), register it with Okareo, create a scenario set of test prompts and expected results, then run an evaluation:
- Python
- TypeScript
import os
from okareo import Okareo
from okareo.model_under_test import GenerationModel
from okareo_api_client.models import ScenarioSetCreate, TestRunType
okareo = Okareo(os.environ["OKAREO_API_KEY"])
model = GenerationModel(
model_id="gpt-4o",
temperature=0.7,
system_prompt_template="{input}",
)
mut = okareo.register_model(name="LLM under evaluation", model=model)
seed_data = Okareo.seed_data_from_list([
{"input": "What is 2+2?", "result": "4"},
{"input": "Hello", "result": "Hi"},
])
scenario = okareo.create_scenario_set(ScenarioSetCreate(name="Basic Q&A", seed_data=seed_data))
test_run = mut.run_test(
scenario=scenario,
name="Example Eval Run",
api_key=os.environ["OPENAI_API_KEY"],
test_run_type=TestRunType.NL_GENERATION,
checks=["reference_similarity"],
)
print(test_run.app_link)
import { Okareo, TestRunType, GenerationModel } from "okareo-ts-sdk";
const okareo = new Okareo({ api_key: process.env.OKAREO_API_KEY! });
const model = await okareo.register_model({
name: "LLM under evaluation",
project_id: "your-project-id",
models: [
{
type: "generation",
model_id: "gpt-4o",
temperature: 0.7,
system_prompt_template: "{input}",
} as GenerationModel,
],
});
const scenario = await okareo.create_scenario_set({
name: "Basic Q&A",
project_id: "your-project-id",
seed_data: [
{ input: "What is 2+2?", result: "4" },
{ input: "Hello", result: "Hi" },
],
});
const testRun = await model.run_test({
model_api_key: process.env.OPENAI_API_KEY!,
project_id: "your-project-id",
name: "Example Eval Run",
scenario,
type: TestRunType.NL_GENERATION,
checks: ["reference_similarity"],
});
console.log(testRun.app_link);
In the above example, we define a generation model with a model_id
and temperature
(you can also set prompt templates if needed). We register it via register_model()
– providing a name and the GenerationModel
object – which returns a ModelUnderTest
handle tied to that model. We then prepare a Scenario Set of two simple Q&A pairs. The helper Okareo.seed_data_from_list()
converts a list of {"input": ..., "result": ...}
mappings into the required SeedData
format. Using okareo.create_scenario_set(...)
, we upload this scenario set to the Okareo platform. Finally, calling mut.run_test(...)
executes the model on each scenario in the set and returns a TestRunItem
result (which includes metrics and outcomes). This allows you to automatically evaluate the model's responses against expected results.
Obtaining results: If run_test
is called synchronously it will usually wait for completion (the HTTP call will return when the evaluation is done). You can also log in to the Okareo web app – each test run is visible there with detailed reports (the TestRunItem.app_link
field gives a direct URL to view the run in the app).
Okareo supports many providers through GenerationModel
(OpenAI, Cohere, Anthropic, etc.). The register_model()
call returns a model handle we'll use for running tests. (Model names must be unique within your project.)
See Generation to learn more about Generation Evaluations.
Running Evaluations – CustomModel
and ModelInvocation
Okareo also supports evaluating custom models – models not covered by built-in providers – by subclassing CustomModel
. You implement the invoke()
method to call your model and return a ModelInvocation
result. This lets you integrate any arbitrary model (e.g from Huggingface) or logic into the Okareo evaluation framework:
- Python
- TypeScript
from okareo.model_under_test import CustomModel, ModelInvocation
from okareo_api_client.models import TestRunType
class MyModel(CustomModel):
def __init__(self):
super().__init__(name="MyModel")
def invoke(self, input_value):
return ModelInvocation(model_prediction=input_value, model_input=input_value)
mut = okareo.register_model(name="Custom Model", model=MyModel())
test_run = mut.run_test(
scenario=scenario,
name="Custom Test",
test_run_type=TestRunType.NL_GENERATION,
checks=["reference_similarity"],
)
import { Okareo, CustomModel, ModelInvocation, TestRunType } from "okareo-ts-sdk";
const okareo = new Okareo({ api_key: process.env.OKAREO_API_KEY! });
const customModel = await okareo.register_model({
name: "Custom Model",
project_id: "your-project-id",
models: {
type: "custom",
invoke: (input: string): ModelInvocation => ({
model_prediction: input,
model_input: input,
}),
} as CustomModel,
});
const testRun = await customModel.run_test({
project_id: "your-project-id",
name: "Custom Test",
scenario,
type: TestRunType.NL_GENERATION,
checks: ["reference_similarity"],
});
In this example, MyModel
inherits from CustomModel
and implements invoke(self, input_value)
. Inside invoke
, we call our actual model logic – here we simply echo the input – and return a ModelInvocation
object containing the prediction and any relevant metadata. The Okareo SDK uses this to handle custom model outputs uniformly. We then register MyModel
with register_model()
(just like a built-in model) and run a test on a given scenario set. The process for creating or reusing scenario sets and executing run_test
is identical, making custom models first-class citizens in the evaluation workflow. This way, you can evaluate proprietary models or algorithms with the same scenario-based approach and metrics as other models.
Retrieval
SDK lets you build and evaluate retrieval systems—used to perform nearest neighbor search on a collection of vectors and fetch relevant context from a dataset based on a query. A typical use case is comparing sparse (like SPLADE) vs. dense (like e5) vs. hybrid embedding methods to see which performs better on your data.
You can bring your own embedding and reranker models to better fit your domain—and evaluate them directly in the SDK. Okareo also lets you fine-tune these models to get the best performance on your data.
Besides flexibility on embedding models, you can leverage Okareo standard connectors with QDrant, Pinecone, or any vector DB/store of your choice. Use a CustomModel
to wrap a vector DB like ChromaDB. Okareo evaluates retrievers with Information Retrieval metrics like Accuracy@k
, Precision@k
, Recall@k
, MAP@k
, MRR@k
, and NDCG@k
.
- Python
- TypeScript
import chromadb
from okareo import Okareo
from okareo.model_under_test import CustomModel, ModelInvocation
from okareo_api_client.models import ScenarioSetCreate, TestRunType
okareo = Okareo(os.environ["OKAREO_API_KEY"])
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(
name="retrieval_test",
metadata={"hnsw:space": "cosine"},
get_or_create=True,
)
# !!! Add documents to collection before running !!!
class RetrievalModel(CustomModel):
@staticmethod
def to_score(results):
parsed = []
for i in range(len(results["distances"][0])):
score = (2 - results["distances"][0][i]) / 2
parsed.append({
"id": results["ids"][0][i],
"score": score,
"metadata": results["documents"][0][i],
"label": f"Doc w/ ID: {results['ids'][0][i]}",
})
return parsed
def invoke(self, input: str):
results = collection.query(query_texts=[input], n_results=5)
return ModelInvocation(
model_prediction=RetrievalModel.to_score(results),
model_output_metadata={"model_data": input},
)
mut = okareo.register_model(name="vectordb_retrieval_test", model=RetrievalModel(name="retrieval"))
seed_data = Okareo.seed_data_from_list([
{"input": "What is Okareo?", "result": ["doc123"]},
{"input": "How do I install Okareo?", "result": ["doc456"]},
])
scenario = okareo.create_scenario_set(ScenarioSetCreate(name="Retrieval Test", seed_data=seed_data))
at_k = [1, 3, 5]
test_run = mut.run_test(
scenario=scenario,
name="Retrieval Run",
test_run_type=TestRunType.INFORMATION_RETRIEVAL,
metrics_kwargs={
"accuracy_at_k": at_k,
"precision_recall_at_k": at_k,
"ndcg_at_k": at_k,
},
)
print(test_run.app_link)
import chromadb from "chromadb";
import { Okareo, CustomModel, ModelInvocation, TestRunType } from "okareo-ts-sdk";
const okareo = new Okareo({ api_key: process.env.OKAREO_API_KEY! });
const chromaClient = new chromadb.Client();
const collection = chromaClient.createCollection({
name: "retrieval_test",
metadata: { "hnsw:space": "cosine" },
get_or_create: true,
});
class RetrievalModel {
readonly type = "custom" as const;
invoke(input: string): ModelInvocation {
const results = collection.query({ query_texts: [input], n_results: 5 });
const scored = results.ids[0].map((id: string, i: number) => ({
id,
score: (2 - results.distances[0][i]) / 2,
metadata: results.documents[0][i],
label: `Doc w/ ID: ${id}`,
}));
return {
model_prediction: scored,
model_output_metadata: { model_data: input },
};
}
}
const mut = await okareo.register_model({
name: "vectordb_retrieval_test",
project_id: "your-project-id",
models: new RetrievalModel() as unknown as CustomModel,
});
const seedData = [
{ input: "What is Okareo?", result: ["doc123"] },
{ input: "How do I install Okareo?", result: ["doc456"] },
];
const scenario = await okareo.create_scenario_set({
name: "Retrieval Test",
project_id: "your-project-id",
seed_data: seedData,
});
const atK = [1, 3, 5];
const testRun = await mut.run_test({
project_id: "your-project-id",
name: "Retrieval Run",
scenario,
type: TestRunType.INFORMATION_RETRIEVAL,
metrics_kwargs: {
accuracy_at_k: atK,
precision_recall_at_k: atK,
ndcg_at_k: atK,
},
});
console.log(testRun.app_link);
Use metrics_kwargs
to customize @k intervals.
You can see additional retrieval examples here:
- Retrieval Evaluation Example: Basic retrieval evaluation notebook
- Cohere & Pinecone Retrieval Example: Retrieval evaluation using Cohere embeddings and Pinecone vector store
- Embedding Model Comparison: Compare different embedding models for retrieval tasks
See Retrieval to learn more about Retrieval Evaluations.
Classification
The Okareo Python SDK makes it easy to set up and evaluate classification tasks—where the goal is to assign input text to one of several predefined categories. A common use case is intent detection in Retrieval-Augmented Generation (RAG) systems, where identifying the user's intent helps retrieve more relevant context and improves response quality.
To define a classification task, you provide instructions that describe the input format, output format, and list of valid categories—like "Support," "Returns," "Membership," or "Sustainability." You can run these tasks with pre-trained models or finetune your own using the SDK to get better results on your specific data. This capability allows creation of highly specialized classification models tailored to specific application needs.
The SDK also includes built-in tools to evaluate classification accuracy and performance directly in the Okareo platform.
Here's a minimal example to get started:
- Python
- TypeScript
import os
from okareo import Okareo
from okareo.model_under_test import CustomModel, ModelInvocation
from okareo_api_client.models import ScenarioSetCreate, TestRunType
okareo = Okareo(os.environ["OKAREO_API_KEY"])
class MyClassifier(CustomModel):
def invoke(self, input_text: str):
input_lower = input_text.lower()
if "return" in input_lower:
label = "returns"
elif "how much" in input_lower:
label = "pricing"
else:
label = "complaints"
return ModelInvocation(
model_prediction=label,
model_input=input_text,
model_output_metadata={
"extracted_entities": {"return": True, "amount": 23}
},
)
mut = okareo.register_model(name="Intent Classifier", model=MyClassifier(name="ClassifierModel"))
seed_data = Okareo.seed_data_from_list([
{"input": "I want to send this product back", "result": "returns"},
{"input": "How much does this product cost?", "result": "pricing"},
])
scenario_set = okareo.create_scenario_set(ScenarioSetCreate(name="Intent Test Set", seed_data=seed_data))
test_run = mut.run_test(
scenario=scenario_set,
name="Classification Test Run",
test_run_type=TestRunType.MULTI_CLASS_CLASSIFICATION,
)
print(test_run.app_link)
import { Okareo, CustomModel, ModelInvocation, TestRunType } from "okareo-ts-sdk";
const okareo = new Okareo({ api_key: process.env.OKAREO_API_KEY! });
const classifier = await okareo.register_model({
name: "Intent Classifier",
project_id: "your-project-id",
models: {
type: "custom",
invoke: (input: string): ModelInvocation => {
const text = input.toLowerCase();
let label = "complaints";
if (text.includes("return")) label = "returns";
else if (text.includes("how much")) label = "pricing";
return {
model_prediction: label,
model_input: input,
model_output_metadata: {
extracted_entities: { return: true, amount: 23 },
},
};
},
} as CustomModel,
});
const seedData = [
{ input: "I want to send this product back", result: "returns" },
{ input: "How much does this product cost?", result: "pricing" },
];
const scenarioSet = await okareo.create_scenario_set({
name: "Intent Test Set",
project_id: "your-project-id",
seed_data: seedData,
});
const testRun = await classifier.run_test({
project_id: "your-project-id",
name: "Classification Test Run",
scenario: scenarioSet,
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
});
console.log(testRun.app_link);
In the above example, we define a simple classification model by subclassing CustomModel
and implementing its invoke
method to return a category label for each input. We register this custom model and set up a scenario set with some example inputs and their expected labels. Finally, we call mut.run_test(...)
with the scenario set to evaluate the classifier's predictions against the expected results. Okareo will automatically compute classification metrics (accuracy, F1, recall, precision) and confusion matrix for the run. The returned TestRunItem (test_run)
contains these metrics and an app_link
to view the detailed results in the Okareo web app.
See Classification to learn more about Classification Evaluations.