Skip to main content

Typescript SDK

Okareo has a rich set of APIs that you can explore through the API Guide. This SDK provides access to all of the Okareo API endpoints through the OpenAPI spec. It also provides convenience functions that make testing and development with the Okareo platform faster.

In addition to making model baseline evaluation available in development, you can use this SDK to drive automation such as in CI/CD or elsewhere.

The Typescript library is transpiled to javascript. As a result this SDK can be used in any common js project.

tip

The SDK requires an API Token. Refer to the Okareo API Key guide for more information.

Overview

Automating Okareo through Typescript can be done multiple ways.

  1. Okareo CLI - Using the Okareo CLI directly will give you the ability to write Typescript/Javascript while keeping Okareo independent from the rest of your project. Refer to the Okareo SDK/CLI pages to learn more.
  2. Unit Testing - We find that models are usually part of a larger applicaiton context. When this is the case, it is beneficial to run model evaluations and scenario expansion as part of your general CI/CD process.

The Okareo cookbooks in github okareo-cookbook provide examples you can build from using the CLI directly, driving Okareo from Jest and more.

SDK Installation

npm install -D okareo-ts-sdk

Using the Okareo Typescript SDK

Jest: Hello Projects!

The following Jest example creates an Okareo instance, requests a list of projects, and then verifies that more than zero projects are returned.

import { Okareo } from 'okareo-ts-sdk';
const OKAREO_API_KEY = process.env.OKAREO_API_KEY;
describe('Example', () => {
test('Get All Projects', async () => {
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const projects: any[] = await okareo.getProjects();
expect(projects.length).toBeGreaterThanOrEqual(0);
});
});

AI/LLM Evaluation Workflow

The following script synthetically transforms a set of direct requests into passive questions and then evaluates the core_app.getIntentContextTemplate(user, chat_history) context through OpenAI to determine if actual intent is maintainted. The number of synthetic examples created is 3 times the number of rows in the DIRECTED_INPUT data passed in.

import { Okareo, OpenAIModel, RunTestProps, ClassificationReporter } from 'okareo-ts-sdk';

const OKAREO_API_KEY = process.env.OKAREO_API_KEY;

const main = async () => {
try {
const okareo = new Okareo({api_key:process.env.OKAREO_API_KEY });

const sData: any = await okareo.create_scenario_set({
name: "Detect Passive Intent",
project_id: project_id,
number_examples: 3,
generation_type: ScenarioType.TEXT_REVERSE_QUESTION,
seed_data: DIRECTED_INTENT
});

const model_under_test = await okareo.register_model({
name: "User Chat Intent - 3.5 Turbo",
tags: ["TS-SDK", "Testing"],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo",
temperature:0.5,
system_prompt_template:core_app.getIntentContextTemplate(user, chat_history),
user_prompt_template:`{scenario_input}`
} as OpenAIModel
});

const eval_run: any = await model_under_test.run_test({
name: "TS-SDK Classification",
tags: ["Classification", "BUILD_ID"],
model_api_key: OPENAI_API_KEY,
project_id: project_id,
scenario_id: sData.scenario_id,
calculate_metrics: true,
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
} as RunTestProps );

const reporter = new ClassificationReporter({
eval_run,
error_max: 2, // allows for up to 2 errors
metrics_min: {
precision: 0.95,
recall: 0.9,
f1: 0.9,
accuracy: 0.95
},
});
reporter.log(); // logs a table to the console output with the report results

} catch (error) {
console.error(error);
}
}

main();

Typescript SDK and Okareo API

The Okareo Typescript SDK is a set of convenience functions and wrappers for the Okareo REST API.

warning

Reporters are only supported in Typescript.
If you are interested in Python support, please let us know.

Class Okareo

create_or_update_check

This uploads or updates a check with the specified name. If the name for the check exists already and the check name is not shared with a predefined Okareo check, then that check will be overwritten. Returns a detailed check response object.

There are two types of checks - Code (Deterministic) and Behavioral (Model Judge). Code based checks are very fast and entirely predictable. They are code. Behavioral checks pass judgement based on inference. Behavioral checks are slower and can be less predictable. However, they are occasionally the best way to express behavioral expectaions. For example, "did the model expose private data?" is hard to analyze deterministically.

Code checks use Python. Okareo will generate the python for you with the typescript okareo.generate_check SDK function. You can then pass the code result to okareo.create_or_update_check

// For code checks (e.g. deterministic)
okareo.create_or_update_check({
name: str,
description: str,
check_config: {
type: CheckOutputType.Score | CheckOutputType.PASS_FAIL,
code_contents: <CHECK_PYTHON_CODE> // Python code that inherits from BaseCheck
}
});

//For behavioral checks (e.g. prompt/judges)
okareo.create_or_update_check({
name: str,
description: str,
check_config: {
type: CheckOutputType.Score | CheckOutputType.PASS_FAIL,
prompt_template: <CHECK_PROMPT> // The prompt describing the desired behavior
}
});


delete_check

Deletes the check with the provided ID and name.

okareo.delete_check("<CHECK-UUID>", "<CHECK-NAME>")
/*
Check deletion was successful
*/

create_scenario_set

A scenario set is the Okareo unit of data collection. Any scenario can be used to drive a registered model or as a seed for synthetic data generation. Often both.

    import { Okareo, SeedData, ScenarioType } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});

okareo.create_scenario_set(
{
name:"NAME OF SCENARIO",
project_id: PROJECT_ID,
number_examples:1,
generation_type: ScenarioType.SEED
seed_data: [
SeedData({
input:"Example input to be sent to the model",
result:"Expected result from the model"
}),
]
}
)

find_datapoints

Datapoints are accessible for research and analysis as part of CI or elsewhere. Datapoints can be returned from a broad range of dimension criteria. Typicaly some combination of time, feedback, and model are used. But there are many others available.

import { Okareo, DatapointSearch } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});

const data: any = await okareo.find_datapoints(
DatapointSearch({
project_id: project_id,
mut_id: model_id,
})
);

generate_check

Generates the contents of a .py file for implementing a CodeBasedCheck based on an EvaluatorSpecRequest. Pass the generated_code of this method's result to the create_or_update_check function to make the check available within Okareo.

const check = okareo.generate_check({
project_id: "",
description: "Return True if the model_output is at least 20 characters long, otherwise return False.",
requires_scenario_input: false, // True if check uses scenario input
requires_scenario_result: false, // True if check uses scenario result
output_data_type: "bool" | "int" | "float", // if pass/fail: 'bool'. if score: 'int' | 'float'
})

generate_scenario_set

Generate synthetic data based on a prior scenario. The seed scenario could be from a prior evaluation run, an upload, or statically defined.

import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});

const data: any = await okareo.generate_scenario_set(
{
project_id: project_id,
name: "EXAMPLE SCENARIO NAME",
source_scenario_id: "SOURCE_SCENARIO_ID",
number_examples: 2,
generation_type: ScenarioType.REPHRASE_INVARIANT,
}
)

get_all_checks

Return the list of all available checks. The returned list will include both predefined checks in Okareo as well as custom checks uploaded in association with your current organization.

okareo.get_all_checks()

get_check

Returns a detailed check response object. Useful if you have a check's ID and want to get more information about the check.

okareo.get_check("<UUID-FOR-CHECK>")

run_test

Run a test directly from a registered model. This requires both a registered model and at least one scenario.

The run_test function is called on a registered model in the form model_under_test.run_test(...). If your model requires an API key to call, then you will need to pass your key in the api_key parameter. Your API keys are not stored by Okareo.

warning

Depending on size and complexity, model runs can take a long time to evaluate. Use scenarios appropriate in size to the task at hand.

Read the Classification Overview to learn more about classificaiton evaluations in Okareo.

// Classification evaluations return accuracy, precision, recall, and f1 scores.
const model_under_test = okareo.register_model(...);
const test_run_response: any = await model_under_test.run_test({
name:"<YOUR_TEST_RUN_NAME>",
tags: [<OPTIONAL_ARRAY_OF_STRING_TAGS>],
project_id: project_id,
scenario_id:"<YOUR_SCENARIO_ID>",
model_api_key: "<YOUR_MODEL_API_KEY>", //Key for OpenAI, Cohere, Pinecone, QDrant, etc.,
calculate_metrics: true,
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
} as RunTestProps);
/*
test_run_response: {
id:str,
project_id:str,
mut_id:str,
scenario_set_id:str,
name:str,
tags:Array[str],
type:'MULTI_CLASS_CLASSIFICATION',
start_time:Date,
end_time=Date,
test_data_point_count:int,
model_metrics: {
'weighted_average': {
'precision': float,
'recall': float,
'f1': float,
'accuracy': float
},
'scores_by_label': {
'label_1': {
'precision': float,
'recall': float,
'f1': float
},
...,
'label_N': {
'precision': float,
'recall': float,
'f1': float
},
}
},
error_matrix: [
{'label_1': [int, ..., int]},
...,
{'label_N': [int, ..., int]}
],
app_link: str
}
*/

ScenarioType

// import { ScenarioType } from "okareo-ts-sdk";
export declare enum ScenarioType {
COMMON_CONTRACTIONS = "COMMON_CONTRACTIONS",
COMMON_MISSPELLINGS = "COMMON_MISSPELLINGS",
CONDITIONAL = "CONDITIONAL",
LABEL_REVERSE_INVARIANT = "LABEL_REVERSE_INVARIANT",
NAMED_ENTITY_SUBSTITUTION = "NAMED_ENTITY_SUBSTITUTION",
NEGATION = "NEGATION",
REPHRASE_INVARIANT = "REPHRASE_INVARIANT",
ROUNDTRIP_INVARIANT = "ROUNDTRIP_INVARIANT",
SEED = "SEED",
TERM_RELEVANCE_INVARIANT = "TERM_RELEVANCE_INVARIANT",
TEXT_REVERSE_LABELED = "TEXT_REVERSE_LABELED",
TEXT_REVERSE_QUESTION = "TEXT_REVERSE_QUESTION"
}

Okareo has multiple synthetic data generators. We have provided details about each generator type below:

Common Contractions

ScenarioType.COMMON_CONTRACTIONS

Each input in the scenario will be shortened by 1 or 2 characters. For example, if the input is What is a steering wheel?, the generated input could be What is a steering whl?.

Common Misspellings

ScenarioType.COMMON_MISSPELLINGS

Common misspellings of the inputs will be generated. For example, if the input is What is a reciept?, the generated input could be What is a reviept?

Conditional

ScenarioType.CONDITIONAL

Each input in the scenario will be rephrased as a conditional statement. For example, if the input is What are the side effects of this medicine?, the generated input could be Considering this medicine, what might be the potential side effects?.

Rephrase

ScenarioType.REPHRASE_INVARIANT

Rephrasings of the inputs will be generated. For example, if the input is Neil Alden Armstrong was an American astronaut and aeronautical engineer who in 1969 became the first person to walk on the Moon, the generated input could be Neil Alden Armstrong, an American astronaut and aeronautical engineer, made history in 1969 as the first individual to set foot on the Moon.

Reverse Question

ScenarioType.TEXT_REVERSE_QUESTION

Each input in the scenario will be rephrased as a question that the input should be the answer for. For example, if the input is The first game of baseball was played in 1846., the generated input could be When was the first game of baseball ever played?.

Seed

ScenarioType.SEED

The simplest of all generators. It does nothing. A true NoOp.

Term Relevance

ScenarioType.TERM_RELEVANCE_INVARIANT

Each input in the scenario will be rephrased to only include the most relevant terms, where relevance is based on the list of inputs provided to the scenario. We will then use parts of speech to determine an valid ordering of relevant terms. For example, if the inputs are all names of various milk teas such as Cool Sweet Honey Taro Milk Tea with Brown Sugar Boba, the generated input could be Taro Milk Tea, since Taro, Milk, and Tea could be the most relevant terms.


get_scenario_sets

Return one or more scenarios based on the project_id or a specific project_id + scenario_id pair

import { Okareo, components } from "okareo-ts-sdk";

const okareo = new Okareo({api_key:OKAREO_API_KEY});
const project_id = "YOUR_PROJECT_ID";
const scenario_id = "YOUR_SCENARIO_ID";

const all_scenarios = await okareo.get_scenario_sets({ project_id });
// or
const specific_scenario = await okareo.get_scenario_sets({ project_id, scenario_id });

get_scenario_data_points

Return each of the datapoints related to a single evaluation run

import { Okareo, components } from "okareo-ts-sdk";
async get_scenario_data_points(scenario_id: string): Promise<components["schemas"]["ScenarioDataPoinResponse"][]> {
//...
}

get_test_run

Return a previously run test. This is useful for "hill-climbing" where you look at a prior run, make changes and re-run or if you want to baseline the current run from the last.

import { Okareo, components } from "okareo-ts-sdk";
async get_test_run(test_run_id: string): Promise<components["schemas"]["TestRunItem"]> {
//...
}

register_model

Register the model that you want to evaluate, test or collect datapoints from. Models must be uniquely named within a project namespace.

In order to run a test, you will need to register a model. If you have already registered a model with the same name, the existing model will be returned. The model data is only updated if the "update: true" flag is passed.

warning

The first time a model is defined, the attributes of the model are persisted. Subsequent calls to register_model will return the persisted model. They will not update the definition.

import { Okareo, CustomModel } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const model_under_test = await okareo.register_model({
name: "Example Custom Model",
tags: ["Custom", "End-2-End"],
project_id: project_id,
models: {
type: "custom",
invoke: (input: string) => {
return {
actual: "Technical Support",
model_response: {
input: input,
method: "hard coded",
context: "Example context response",
}
}
}
} as CustomModel
});

Okareo has ready-to-run integrations with the following models and vector databases. Don't hesitate to reach out if you need another model.

OpenAI (LLM)

import { OpenAIModel } from 'okareo';

interface OpenAIModel extends BaseModel {
type: "openai";
model_id: string;
temperature: number;
system_prompt_template: string;
user_prompt_template: string;
dialog_template: string;
tools?: unknown[];
}

Generation Model (LLM)

import { GenerationModel } from 'okareo';

interface GenerationModel extends BaseModel {
type: "generation";
model_id: string;
temperature: number;
system_prompt_template: string;
user_prompt_template: string;
dialog_template: string;
tools?: unknown[];
}

The GenerationModel is a universal LLM interface that supports most model providers. Users can plug in different model names, including OpenAI, Anthropic, and Cohere models.

Example using Cohere model with GenerationModel:

import { GenerationModel } from 'okareo';

const cohereModel: GenerationModel = {
type: "generation",
model_id: "command-r",
temperature: 0.7,
system_prompt_template: "You are a helpful assistant.",
};

Example with tools:

import { GenerationModel } from 'okareo';

const tools = [
{
type: "function",
function: {
name: "get_current_weather",
description: "Get the current weather in a given location",
parameters: {
type: "object",
properties: {
location: {
type: "string",
description: "The city and state, e.g. San Francisco, CA"
},
unit: {
type: "string",
enum: ["celsius", "fahrenheit"]
}
},
required: ["location"]
}
}
}
];

const modelWithTools: GenerationModel = {
type: "generation",
model_id: "gpt-3.5-turbo-0613",
temperature: 0.7,
system_prompt_template: "You are a helpful assistant with access to weather information.",
tools: tools
};

In these examples, we're using the Cohere "command-r" model and the OpenAI "gpt-3.5-turbo-0613" model through the GenerationModel interface. The second example demonstrates how to include tools, which can be used for function calling capabilities.

Pinecone (VectorDB)

//import { TPineconeDB } from "okareo-ts-sdk";
export interface TPineconeDB extends BaseModel {
type?: string | undefined; // from BaseModel
tags?: string[] | undefined; // from BaseModel
index_name: string;
region: string;
project_id: string;
top_k: string;
}

QDrant (VectorDB)

//import { TQDrant } from "okareo-ts-sdk";
export interface TQDrant extends BaseModel {
type?: string | undefined; // from BaseModel
tags?: string[] | undefined; // from BaseModel
collection_name: string;
url: string;
top_k: string;
}

Custom Model

You can use the CustomModel object to define your own custom, provider-agnostic models.

//import { TCustomModel } from "okareo-ts-sdk";
export interface TCustomModel extends BaseModel {
invoke(input: string): {
actual: any | string;
model_response: {
input: any | string;
method: any | string;
context: any | string;
}
};
}

To use the CustomModel object, you will need to implement an invoke method that returns a ModelInvocation object. For example,

import { CustomModel, ModelInvocation } from "okareo-ts-sdk";

const my_custom_model: CustomModel = {
type: "custom",
invoke: (input: string) => {
// your model's invoke logic goes here
return {
model_prediction: ...,
model_input: input,
model_output_metadata: {
prediction: ...,
other_data_1: ...,
other_data_2: ...,
...,
},
tool_calls: ...
} as ModelInvocation
}
}

Where the ModelInvocation's inputs are defined as follows:

export interface  ModelInvocation {
/**
* Prediction from the model to be used when running the evaluation,
* e.g. predicted class from classification model or generated text completion from
* a generative model. This would typically be parsed out of the overall model_output_metadata
*/
model_prediction?: Record<string, any> | unknown[] | string;
/**
* All the input sent to the model
*/
model_input?: Record<string, any> | unknown[] | string;
/**
* Full model response, including any metadata returned with model's output
*/
model_output_metadata?: Record<string, any> | unknown[] | string;
/**
* List of tool calls made during the model invocation, if any
*/
tool_calls?: any[];
}

The logic of your invoke method depends on many factors, chief among them the intended TestRunType of the CustomModel. Below, we highlight an example of how to use CustomModel for each TestRunType in Okareo.

The following CustomModel classification example is taken from the custommodel.test.ts script. This model always returns "Technical Support" as the model_prediction.

const classificationModel = CustomModel({
type: "custom",
invoke: (input: string) => {
return {
model_prediction: "Technical Support",
model_input: input,
model_output_metadata: {
input: input,
method: "hard coded",
context: "Example context"
}
} as ModelInvocation
}
});

MultiTurnDriver

A MultiTurnDriver allows you to evaluate a language model over the course of a full conversation. The MultiTurnDriver is made up of two pieces: a Driver and a Target.

The Driver is defined in your MultiTurnDriver, while your Target is defined as either a CustomMultiturnTarget or a GenerationModel.

// import { MultiTurnDriver, StopConfig } from "okareo-ts-sdk"
export interface MultiTurnDriver extends BaseModel {
type: "driver";
target: GenerationModel | CustomMultiturnTarget;
driver_temperature: number = 0.8
max_turns: bigint = 5
repeats: bigint = 1
first_turn: string = "target"
stop_check: StopConfig
}

Driver

The possible parameters for the Driver are:

driver_temperature: number = 1.0
max_turns: bigint = 5
repeats: bigint = 1
first_turn: string = "target"
stop_check: StopConfig

driver_temperature defines temperature used in the model that will simulate a user.

max_turns defines the maximum number of back-and-forth interactions that can be in the conversation.

repeats defines how many times each row in a scenario will be run when a model is run with run_test. Since the Driver is non-deterministic, repeating the same row of a scenario can lead to different conversations.

first_turn defines whether the Target or the Driver will send the first message in the conversation.

stop_check defines how the check will stop. It requires the check name, and a boolean value defining whether or not it stops on a True or False value returned from the check.

Target

A Target is either a GenerationModel or a CustomMultiturnTarget. Refer to GenerationModel for details on GenerationModel.

The only exception to the standard usage is that a system_prompt_template is required when using a MultiTurnDriver. The system_prompt_template defines the system prompt for how the Target should behave.

A CustomMultiturnTarget is defined in largely the same way as a CustomModel. The key difference is that the input is a list of messages in OpenAI's message format.

Driver and Target Interaction

The Driver simulates user behavior, while the Target represents the AI model being tested. This setup allows for testing complex scenarios and evaluating the model's performance over extended conversations.

Setting up a scenario

Scenarios in MultiTurnDriver are crafted using SeedData, where the input field serves as a driver prompt, instructing the simulated user (Driver) on how to behave throughout the conversation, including specific questions to ask, responses to give, and even how to react to the model's function calls, thereby creating a controlled yet dynamic testing environment for evaluating the model's performance across various realistic interaction patterns.

const seedData: SeedData[] = [
{
input: "You are interacting with a customer service agent. First, ask about WebBizz...",
result: "N/A",
},
// ... more seed data
];

Tools and Function Calling

The Target model can be equipped with tools, which are essentially functions the model can call. For instance:

const tools = [
{
type: "function",
function: {
name: "delete_account",
description: "Deletes the user's account",
// ... parameter details
},
}
];

These tools allow the model to perform specific actions, like deleting a user account in this case.

Mocking Tool Results

The driver prompt can be used to mock the results of tool calls. This is crucial for testing how the model responds to different outcomes without actually performing the actions. For example:

const input = `... If you receive any function calls, output the result in JSON format 
and provide a JSON response indicating that the deletion was successful.`;

This prompt instructs the Driver to simulate a successful account deletion when the function is called.

Checks and Conversation Control

Checks are used to evaluate specific aspects of the conversation or to control its flow. For instance:

const stopCheck: StopConfig = {
check_name: "task_completion_delete_account",
stop_on: true,
};

This configuration stops the conversation when the account deletion task is completed.

Custom checks can be created to evaluate various aspects of the conversation:

okareo.createOrUpdateCheck({
name: 'task_completion_delete_account',
description: "Check if the agent confirms account deletion",
check: new ModelBasedCheck(/* ... */)
});

These checks can assess task completion, adherence to guidelines, or any other relevant criteria.

upload_scenario_set

Batch upload jsonl formatted data to create a scenario. This is the most efficient method for pushing large data sets for tests and evaluations.

import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const data: any = await okareo.upload_scenario_set(
{
file_path: "example_data/seed_data.jsonl",
scenario_name: "Uploaded Scenario Set",
project_id: project_id
}
);

Reporters

Primarily part of the Okareo Typescript SDK are a set of reporters. The reporters allow you to get rapid feedback in CI or locally from the command line.

Reporters are convenience functions that interpret evaluations based on thresholds that you provide. The reporters are not persisted and do not alter or change the evaluation. They are simply conveniences for rapid summarization locally and in CI.

Singleton Evaluation Reporters

There are two categories of reporters. The singleton reporters are based on specific evaluation types and can report on each. You can set thresholds specific to classification, retrieval, or generation and the reporters will provide detailed pass/fail information. The second category provides trend information. The history reporter takes a list of evaluations along with a threshold instance and returns a table of results over time.

Class ClassificationReporter

The classification reporter takes the evaluated metrics and the confusion matrix and returns a pass/fail, count of errors, and the specific metric that fails.

info

By convention we define the reporter thresholds independently. This way we can re-use them in trend analysis and across evaluations.

Example console output from passing test: Okareo Diagram

import { ClassificationReporter } from "okareo-ts-sdk";
/*
... body of evaluation
*/
const eval_thresholds = {
error_max: 8,
metrics_min: {
precision: 0.95,
recall: 0.9,
f1: 0.9,
accuracy: 0.95
}
}
const reporter = new ClassificationReporter({
eval_run:classification_run,
...eval_thresholds,
});
reporter.log(); //provides a table of results
/*
// do something if it fails
if (!reporter.pass) { ... }
*/

Class RetrievalReporter

The retrieval reporter provides a shortcut for metrics @k. Each metric can reference a different k value. The result of the report is always in summary form and only returns metrics that exceed thresholds.

Example console output from failing test: Okareo Diagram

import { classification_reporter } from "okareo-ts-sdk";
/*
... body of evaluation
*/
const report = retrieval_reporter(
{
eval_run:data, // data from a retrieval run
metrics_min: {
'Accuracy@k': {
value: 0.96,
at_k: 3
},
'Precision@k': {
value: 0.5,
at_k: 1 // can use different k values by metric
},
'Recall@k': {
value: 0.8,
at_k: 2 // can use different k values by metric
},
'NDCG@k': {
value: 0.2,
at_k: 3
},
'MRR@k': {
value: 0.96,
at_k: 3
},
'MAP@k': {
value: 0.96,
at_k: 3
}
}
}
);
expect(report.pass).toBeTruthy(); // example report assertion

Class GenerationReporter

The genration reporter takes an arbitrary list of metric name:value pairs and reports on results that did not meet the minimum threshold defined. Often these metrics are unique to your circumstance. Boolean values will be treated as "0" or "1".

Example console output from failing test: Okareo Diagram

import { classification_reporter } from "okareo-ts-sdk";
/*
... body of evaluation
*/
const report = generation_reporter(
{
eval_run:data,
metrics_min: {
coherence: 4.9,
consistency: 3.2,
fluency: 4.7,
relevance: 4.3,
overall: 4.1
}
}
);
expect(report.pass).toBeTruthy(); // example report assertion

History Reporter

The second category of reporter provides historical information based on a series of test runs. Like the singletons, each reporter analyzes a single evaluation type at a time. However the mechanism is shared across all types.

Class EvaluationHistoryReporter

The EvaluationHistoryReporter requires four inputs: the evaluation type, list of evals, assertions, and the number to render. The type must be one of the Okareo TestRunType definitions. The assertions are shared with the singleton reports.

info

By convention we define the reporter thresholds independently. Re-using thresholds between singleton reports and historic reports is one of the many reasons.

Classification Report Okareo Diagram

Retrieval Report Okareo Diagram

Generation Report Okareo Diagram

const history_class = new EvaluationHistoryReporter({
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
evals:[TEST_RUN_CLASSIFICATION as components["schemas"]["TestRunItem"], TEST_RUN_CLASSIFICATION as components["schemas"]["TestRunItem"]],
assertions: class_metrics,
last_n: 5,
});
history_class.log();

Exporting Reports for CI

Class JSONReporter

When using Okareo as part of a CI run, it is useful to export evaluations into a common location that can be picked up by the CI analytics.

By using JSONReporter.log([eval_run, ...]) after each evaluation, Okareo will collect the json results in ./.okareo/reports. The location can be controlled as part of the CLI with the -r LOCATION or --report LOCATION parameters. The output JSON is useful in CI for historical reference.

info

JSONReporter.log([eval_run, ...]) will output to the console unless the evaluation is initiated by the CLI.

import { JSONReporter } from 'okareo-ts-sdk';
const reporter = new JSONReporter([eval_run]);
reporter.log();