Typescript SDK
Okareo has a rich set of APIs that you can explore through the API Guide. This SDK provides access to all of the Okareo API endpoints through the OpenAPI spec. It also provides convenience functions that make testing and development with the Okareo platform faster.
In addition to making model baseline evaluation available in development, you can use this SDK to drive automation such as in CI/CD or elsewhere.
The Typescript library is transpiled to javascript. As a result this SDK can be used in any common js project.
The SDK requires an API Token. Refer to the Okareo API Key guide for more information.
Overview
Automating Okareo through Typescript can be done multiple ways.
- Okareo CLI - Using the Okareo CLI directly will give you the ability to write Typescript/Javascript while keeping Okareo independent from the rest of your project. Refer to the Okareo SDK/CLI pages to learn more.
- Unit Testing - We find that models are usually part of a larger applicaiton context. When this is the case, it is beneficial to run model evaluations and scenario expansion as part of your general CI/CD process.
The Okareo cookbooks in github okareo-cookbook provide examples you can build from using the CLI directly, driving Okareo from Jest and more.
SDK Installation
- NPM
- Yarn
npm install -D okareo-ts-sdk
yarn add -D okareo-ts-sdk
Using the Okareo Typescript SDK
Jest: Hello Projects!
The following Jest example creates an Okareo instance, requests a list of projects, and then verifies that more than zero projects are returned.
import { Okareo } from 'okareo-ts-sdk';
const OKAREO_API_KEY = process.env.OKAREO_API_KEY;
describe('Example', () => {
test('Get All Projects', async () => {
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const projects: any[] = await okareo.getProjects();
expect(projects.length).toBeGreaterThanOrEqual(0);
});
});
AI/LLM Evaluation Workflow
The following script synthetically transforms a set of direct requests into passive questions and then evaluates the core_app.getIntentContextTemplate(user, chat_history)
context through OpenAI to determine if actual intent is maintainted.
The number of synthetic examples created is 3 times the number of rows in the DIRECTED_INPUT
data passed in.
import { Okareo, OpenAIModel, RunTestProps, ClassificationReporter } from 'okareo-ts-sdk';
const OKAREO_API_KEY = process.env.OKAREO_API_KEY;
const main = async () => {
try {
const okareo = new Okareo({api_key:process.env.OKAREO_API_KEY });
const sData: any = await okareo.create_scenario_set({
name: "Detect Passive Intent",
project_id: project_id,
number_examples: 3,
generation_type: ScenarioType.TEXT_REVERSE_QUESTION,
seed_data: DIRECTED_INTENT
});
const model_under_test = await okareo.register_model({
name: "User Chat Intent - 3.5 Turbo",
tags: ["TS-SDK", "Testing"],
project_id: project_id,
models: {
type: "openai",
model_id:"gpt-3.5-turbo",
temperature:0.5,
system_prompt_template:core_app.getIntentContextTemplate(user, chat_history),
user_prompt_template:`{scenario_input}`
} as OpenAIModel
});
const eval_run: any = await model_under_test.run_test({
name: "TS-SDK Classification",
tags: ["Classification", "BUILD_ID"],
model_api_key: OPENAI_API_KEY,
project_id: project_id,
scenario_id: sData.scenario_id,
calculate_metrics: true,
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
} as RunTestProps );
const reporter = new ClassificationReporter({
eval_run,
error_max: 2, // allows for up to 2 errors
metrics_min: {
precision: 0.95,
recall: 0.9,
f1: 0.9,
accuracy: 0.95
},
});
reporter.log(); // logs a table to the console output with the report results
} catch (error) {
console.error(error);
}
}
main();
Typescript SDK and Okareo API
The Okareo Typescript SDK is a set of convenience functions and wrappers for the Okareo REST API.
Reporters are only supported in Typescript.
If you are interested in Python support, please let us know.
Class Okareo
create_or_update_check
This uploads or updates a check with the specified name. If the name for the check exists already and the check name is not shared with a predefined Okareo check, then that check will be overwritten. Returns a detailed check response object.
There are two types of checks - Code (Deterministic) and Behavioral (Model Judge). Code based checks are very fast and entirely predictable. They are code. Behavioral checks pass judgement based on inference. Behavioral checks are slower and can be less predictable. However, they are occasionally the best way to express behavioral expectaions. For example, "did the model expose private data?" is hard to analyze deterministically.
Code checks use Python. Okareo will generate the python for you with the typescript okareo.generate_check
SDK function. You can then pass the code result to okareo.create_or_update_check
- Usage
- Result
- my_custom_check.py
// For code checks (e.g. deterministic)
okareo.create_or_update_check({
name: str,
description: str,
check_config: {
type: CheckOutputType.Score | CheckOutputType.PASS_FAIL,
code_contents: <CHECK_PYTHON_CODE> // Python code that inherits from BaseCheck
}
});
//For behavioral checks (e.g. prompt/judges)
okareo.create_or_update_check({
name: str,
description: str,
check_config: {
type: CheckOutputType.Score | CheckOutputType.PASS_FAIL,
prompt_template: <CHECK_PROMPT> // The prompt describing the desired behavior
}
});
/** EvaluatorDetailedResponse */
EvaluatorDetailedResponse: {
/**
* Id
* Format: uuid
*/
id?: string;
/**
* Project Id
* Format: uuid
*/
project_id?: string;
/** Name */
name?: string;
/**
* Description
* @default
*/
description?: string;
/** Requires Scenario Input */
requires_scenario_input?: boolean;
/** Requires Scenario Result */
requires_scenario_result?: boolean;
/**
* Output Data Type
* @default
*/
output_data_type?: string;
/**
* Code Contents
* @default
*/
code_contents?: string;
/**
* Time Created
* Format: date-time
*/
time_created?: string;
/** Warning */
warning?: string;
/** Check Config */
check_config?: Record<string, never>;
};
from okareo.checks import CodeBasedCheck
# any other imports required for your check
class Check(CodeBasedCheck):
@staticmethod
def evaluate(
model_output: str, scenario_input: str, scenario_result: str, metadata: dict
) -> Union[bool, int, float]:
# Your code here
output = ...
return output
delete_check
Deletes the check with the provided ID and name.
okareo.delete_check("<CHECK-UUID>", "<CHECK-NAME>")
/*
Check deletion was successful
*/
create_scenario_set
A scenario set is the Okareo unit of data collection. Any scenario can be used to drive a registered model or as a seed for synthetic data generation. Often both.
- Usage
- Details
- Result
import { Okareo, SeedData, ScenarioType } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
okareo.create_scenario_set(
{
name:"NAME OF SCENARIO",
project_id: PROJECT_ID,
number_examples:1,
generation_type: ScenarioType.SEED
seed_data: [
SeedData({
input:"Example input to be sent to the model",
result:"Expected result from the model"
}),
]
}
)
Takes a single argument ScenarioSetCreate
async create_scenario_set(props: components["schemas"]["ScenarioSetCreate"]): Promise<components["schemas"]["ScenarioSetResponse"]> {
//...
}
import { components } from "okareo-ts-sdk";
//components["schemas"]["ScenarioSetCreate"]
ScenarioSetCreate: {
/**
* Project Id
* Format: uuid
* @description ID for the project
*/
project_id?: string;
/**
* Name
* @description Name of the scenario set
*/
name: string;
/**
* Seed Data
* @description Seed data is a list of dictionaries, each with an input and result
*/
seed_data: components["schemas"]["SeedData"][];
/**
* Number Examples
* @description Number of examples
*/
number_examples: number;
/**
* @description Type of generation. Current supported scenario types are:<br />
* Seed: Seed data for a scenario set<br />
* Rephrase invariant: Results will be rephrased versions of inputs<br />
* Conditional: Results will be rephrased inputs represented in a conditional format<br />
* Text reverse question: The result will be the target question for the input<br />
* Text reverse label: The result will be the intent of the target question for the input
* @default SEED
*/
generation_type?: components["schemas"]["ScenarioType"];
/**
* @description Tone to use for scenario generation.
* @default Neutral
*/
generation_tone?: components["schemas"]["GenerationTone"];
};
import { components } from "okareo-ts-sdk";
// components["schemas"]["ScenarioSetResponse"]
/** ScenarioSetResponse */
ScenarioSetResponse: {
/**
* Scenario Id
* Format: uuid
*/
scenario_id: string;
/**
* Project Id
* Format: uuid
*/
project_id: string;
/**
* Time Created
* Format: date-time
*/
time_created: string;
/** Type */
type: string;
/**
* Tags
* @default []
*/
tags?: string[];
/** Name */
name?: string;
/**
* Seed Data
* @default []
*/
seed_data?: components["schemas"]["SeedData"][];
/**
* Scenario Count
* @default 0
*/
scenario_count?: number;
/**
* Scenario Input
* @default []
*/
scenario_input?: string[];
/**
* App Link
* @description This URL links to the Okareo webpage for this scenario set
* @default
*/
app_link?: string;
};
find_datapoints
Datapoints are accessible for research and analysis as part of CI or elsewhere. Datapoints can be returned from a broad range of dimension criteria. Typicaly some combination of time, feedback, and model are used. But there are many others available.
- Usage
- Details
- Result
import { Okareo, DatapointSearch } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const data: any = await okareo.find_datapoints(
DatapointSearch({
project_id: project_id,
mut_id: model_id,
})
);
async find_datapoints(props: components["schemas"]["DatapointSearch"]): Promise<components["schemas"]["DatapointListItem"][]> {
//...
}
import { components } from "okareo-ts-sdk";
// components["schemas"]["DatapointSearch"]
DatapointSearch: {
/**
* Tags
* @description Tags are strings that can be used to filter datapoints in the Okareo app
* @default []
*/
tags?: string[];
/**
* From Date
* Format: date-time
* @description Earliest date
* @default 2022-12-31T23:59:59.999999
*/
from_date?: string;
/**
* To Date
* Format: date-time
* @description Latest date
*/
to_date?: string;
/**
* Feedback
* @description Feedback is a 0 to 1 float value that captures user feedback range for related datapoint results
*/
feedback?: number;
/** Error Code */
error_code?: string;
/**
* Context Token
* @description Context token is a unique token to link various datapoints which originate from the same context
*/
context_token?: string;
/**
* Project Id
* @description Project ID
*/
project_id?: string;
/**
* Mut Id
* Format: uuid
* @description Model ID
*/
mut_id?: string;
/**
* Test Run Id
* Format: uuid
* @description Test run ID
*/
test_run_id?: string;
};
Returns an array of DatapointListItem objects
import { components } from "okareo-ts-sdk";
// components["schemas"]["DatapointListItem"]
DatapointListItem: {
/**
* Id
* Format: uuid
*/
id: string;
/**
* Tags
* @default []
*/
tags?: string[];
/** Input */
input?: Record<string, never> | unknown[] | string;
/**
* Input Datetime
* Format: date-time
*/
input_datetime?: string;
/** Result */
result?: Record<string, never> | unknown[] | string;
/**
* Result Datetime
* Format: date-time
*/
result_datetime?: string;
/** Feedback */
feedback?: number;
/** Error Message */
error_message?: string;
/** Error Code */
error_code?: string;
/**
* Time Created
* Format: date-time
*/
time_created?: string;
/** Context Token */
context_token?: string;
/**
* Mut Id
* Format: uuid
*/
mut_id?: string;
/**
* Project Id
* Format: uuid
*/
project_id?: string;
/**
* Test Run Id
* Format: uuid
*/
test_run_id?: string;
};
generate_check
Generates the contents of a .py
file for implementing a CodeBasedCheck
based on an EvaluatorSpecRequest
. Pass the generated_code
of this method's result to the create_or_update_check
function to make the check available within Okareo.
- Usage
- Result
const check = okareo.generate_check({
project_id: "",
description: "Return True if the model_output is at least 20 characters long, otherwise return False.",
requires_scenario_input: false, // True if check uses scenario input
requires_scenario_result: false, // True if check uses scenario result
output_data_type: "bool" | "int" | "float", // if pass/fail: 'bool'. if score: 'int' | 'float'
})
/** EvaluatorGenerateResponse */
EvaluatorGenerateResponse: {
/** Name */
name?: string;
/** Description */
description?: string;
/** Requires Scenario Input */
requires_scenario_input?: boolean;
/** Requires Scenario Result */
requires_scenario_result?: boolean;
/** Output Data Type */
output_data_type?: string;
/** Generated Code */
generated_code?: string;
};
generate_scenario_set
Generate synthetic data based on a prior scenario. The seed scenario could be from a prior evaluation run, an upload, or statically defined.
- Usage
- Details
- Result
import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const data: any = await okareo.generate_scenario_set(
{
project_id: project_id,
name: "EXAMPLE SCENARIO NAME",
source_scenario_id: "SOURCE_SCENARIO_ID",
number_examples: 2,
generation_type: ScenarioType.REPHRASE_INVARIANT,
}
)
Take a single argument ScenarioSetGenerate
async generate_scenario_set(props: components["schemas"]["ScenarioSetGenerate"]): Promise<components["schemas"]["ScenarioSetResponse"]> {
//...
}
import { components } from "okareo-ts-sdk";
// components["schemas"]["ScenarioSetGenerate]
/** ScenarioSetGenerate */
ScenarioSetGenerate: {
/**
* Project Id
* Format: uuid
* @description ID for the project
*/
project_id?: string;
/**
* Source Scenario Id
* Format: uuid
* @description ID for the scenario set that the generated scenario set will use as a source
*/
source_scenario_id: string;
/**
* Name
* @description Name of the generated scenario set
*/
name: string;
/**
* Number Examples
* @description Number of examples to be generated for the scenario set
*/
number_examples: number;
/**
* @description Type of generation. Current supported scenario types are:<br />
* Seed: Seed data for a scenario set<br />
* Rephrase invariant: Results will be rephrased versions of inputs<br />
* Conditional: Results will be rephrased inputs represented in a conditional format<br />
* Text reverse question: The result will be the target question for the input<br />
* Text reverse label: The result will be the intent of the target question for the input
* @default REPHRASE_INVARIANT
*/
generation_type?: components["schemas"]["ScenarioType"];
/**
* @description Tone to use for scenario generation.
* @default Neutral
*/
generation_tone?: components["schemas"]["GenerationTone"];
};
import { components } from "okareo-ts-sdk";
//components["schemas"]["ScenarioSetResponse"]
/** ScenarioSetResponse */
ScenarioSetResponse: {
/**
* Scenario Id
* Format: uuid
*/
scenario_id: string;
/**
* Project Id
* Format: uuid
*/
project_id: string;
/**
* Time Created
* Format: date-time
*/
time_created: string;
/** Type */
type: string;
/**
* Tags
* @default []
*/
tags?: string[];
/** Name */
name?: string;
/**
* Seed Data
* @default []
*/
seed_data?: components["schemas"]["SeedData"][];
/**
* Scenario Count
* @default 0
*/
scenario_count?: number;
/**
* Scenario Input
* @default []
*/
scenario_input?: string[];
/**
* App Link
* @description This URL links to the Okareo webpage for this scenario set
* @default
*/
app_link?: string;
};
get_all_checks
Return the list of all available checks. The returned list will include both predefined checks in Okareo as well as custom checks uploaded in association with your current organization.
- Usage
- Result
okareo.get_all_checks()
EvaluatorBriefResponse: {
/**
* Id
* Format: uuid
*/
id?: string;
/** Name */
name?: string;
/**
* Description
* @default
*/
description?: string;
/**
* Output Data Type
* @default
*/
output_data_type?: string;
/**
* Time Created
* Format: date-time
*/
time_created?: string;
/** Check Config */
check_config?: Record<string, never>;
};
get_check
Returns a detailed check response object. Useful if you have a check's ID and want to get more information about the check.
- Usage
- Result
okareo.get_check("<UUID-FOR-CHECK>")
EvaluatorDetailedResponse: {
/**
* Id
* Format: uuid
*/
id?: string;
/**
* Project Id
* Format: uuid
*/
project_id?: string;
/** Name */
name?: string;
/**
* Description
* @default
*/
description?: string;
/** Requires Scenario Input */
requires_scenario_input?: boolean;
/** Requires Scenario Result */
requires_scenario_result?: boolean;
/**
* Output Data Type
* @default
*/
output_data_type?: string;
/**
* Code Contents
* @default
*/
code_contents?: string;
/**
* Time Created
* Format: date-time
*/
time_created?: string;
/** Warning */
warning?: string;
/** Check Config */
check_config?: Record<string, never>;
};
run_test
Run a test directly from a registered model. This requires both a registered model and at least one scenario.
The run_test
function is called on a registered model in the form model_under_test.run_test(...)
. If your model requires an API key to call, then you will need to pass your key in the api_key
parameter. Your API keys are not stored by Okareo.
Depending on size and complexity, model runs can take a long time to evaluate. Use scenarios appropriate in size to the task at hand.
- Classification
- Retrieval
- Generation
Read the Classification Overview to learn more about classificaiton evaluations in Okareo.
// Classification evaluations return accuracy, precision, recall, and f1 scores.
const model_under_test = okareo.register_model(...);
const test_run_response: any = await model_under_test.run_test({
name:"<YOUR_TEST_RUN_NAME>",
tags: [<OPTIONAL_ARRAY_OF_STRING_TAGS>],
project_id: project_id,
scenario_id:"<YOUR_SCENARIO_ID>",
model_api_key: "<YOUR_MODEL_API_KEY>", //Key for OpenAI, Cohere, Pinecone, QDrant, etc.,
calculate_metrics: true,
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
} as RunTestProps);
/*
test_run_response: {
id:str,
project_id:str,
mut_id:str,
scenario_set_id:str,
name:str,
tags:Array[str],
type:'MULTI_CLASS_CLASSIFICATION',
start_time:Date,
end_time=Date,
test_data_point_count:int,
model_metrics: {
'weighted_average': {
'precision': float,
'recall': float,
'f1': float,
'accuracy': float
},
'scores_by_label': {
'label_1': {
'precision': float,
'recall': float,
'f1': float
},
...,
'label_N': {
'precision': float,
'recall': float,
'f1': float
},
}
},
error_matrix: [
{'label_1': [int, ..., int]},
...,
{'label_N': [int, ..., int]}
],
app_link: str
}
*/
Read the Retrieval Overview to learn more about retrieval evaluations in Okareo.
// Specify retrieval metrics and corresponding K values.
// Below, we use the same k_vals for all available metrics,
// but you can specify any subset of these metrics with
// different sets of K values to evaluate.
const k_max = 5;
const k_vals = [1, 2, 5, 7, 10];
const metrics_kwargs = {
"accuracy_at_k": k_vals,
"precision_recall_at_k": k_vals,
"ndcg_at_k": k_vals,
"mrr_at_k": k_vals,
"map_at_k": k_vals,
}
const model_under_test = okareo.register_model(...);
const test_run_response: any = await model_under_test.run_test({
name:"<YOUR_TEST_RUN_NAME>",
project_id: project_id,
scenario_id:"<YOUR_SCENARIO_ID>",
type:TestRunType.INFORMATION_RETRIEVAL,
model_api_key: "<YOUR_MODEL_API_KEY>", //Key for OpenAI, Cohere, Pinecone, QDrant, etc.,
metrics_kwargs: metrics_kwargs,
} as RunTestProps);
/*
test_run_response: {
id:str,
project_id:str,
mut_id:str,
scenario_set_id:str,
name:str,
tags:Array[str],
type:'INFORMATION_RETRIEVAL',
start_time:Date,
end_time=Date,
test_data_point_count:int,
model_metrics: {
'Accuracy@k': {'1': float, ..., '5': float},
'Precision@k': {'1': float, ..., '5': float},
'Recall@k': {'1': float, ..., '5': float},
'NDCG@k': {'1': float, ..., '5': float},
'MRR@k': {'1': float, ..., '5': float},
'MAP@k': {'1': float, ..., '5': float},
'row_level_metrics': {
'<UUID-FOR-ROW-1>': {
'1': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
...,
'5': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
},
...,
'<UUID-FOR-ROW-N>': {
'1': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
...,
'5': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
}
}
},
error_matrix=[],
app_link: str
}
*/
To perform evaluations of generative models, you will need to specify your desired checks.
Read the Generation Overview to learn more about generation evaluations in Okareo.
const model_under_test = okareo.register_model(...);
const test_run_response: any = await model_under_test.run_test({
model_api_key: "<YOUR_MODEL_API_KEY>", //Key for OpenAI, Cohere, Pinecone, QDrant, etc.,
name:"<YOUR_TEST_RUN_NAME>",
tags: [<OPTIONAL_ARRAY_OF_STRING_TAGS>],
project_id: project_id,
scenario_id:"<YOUR_SCENARIO_ID>",
calculate_metrics: true,
checks: ['CHECK_NAME_1', ..., 'CHECK_NAME_N']
type: TestRunType.NL_GENERATION,
} as RunTestProps);
/*
test_run_response: {
id:str,
project_id:str,
mut_id:str,
scenario_set_id:str,
name:str,
tags:Array[str],
type:'NL_GENERATION',
start_time:Date,
end_time=Date,
test_data_point_count:int,
model_metrics: {
'mean_scores': {
'CHECK_NAME_1' : float,
...,
'CHECK_NAME_N': float,
},
'scores_by_row': [
{
'scenario_index': 1,
'test_id': "UUID-FOR-ROW-1",
'CHECK_NAME_1': float,
...,
'CHECK_NAME_N': float,
},
...,
{
'scenario_index': M,
'test_id': "UUID-FOR-ROW-M",
'CHECK_NAME_1': float,
...,
'CHECK_NAME_N': float,
}
]
},
error_matrix: [],
app_link: str
}
*/
ScenarioType
// import { ScenarioType } from "okareo-ts-sdk";
export declare enum ScenarioType {
COMMON_CONTRACTIONS = "COMMON_CONTRACTIONS",
COMMON_MISSPELLINGS = "COMMON_MISSPELLINGS",
CONDITIONAL = "CONDITIONAL",
LABEL_REVERSE_INVARIANT = "LABEL_REVERSE_INVARIANT",
NAMED_ENTITY_SUBSTITUTION = "NAMED_ENTITY_SUBSTITUTION",
NEGATION = "NEGATION",
REPHRASE_INVARIANT = "REPHRASE_INVARIANT",
ROUNDTRIP_INVARIANT = "ROUNDTRIP_INVARIANT",
SEED = "SEED",
TERM_RELEVANCE_INVARIANT = "TERM_RELEVANCE_INVARIANT",
TEXT_REVERSE_LABELED = "TEXT_REVERSE_LABELED",
TEXT_REVERSE_QUESTION = "TEXT_REVERSE_QUESTION"
}
Okareo has multiple synthetic data generators. We have provided details about each generator type below:
Common Contractions
ScenarioType.COMMON_CONTRACTIONS
Each input in the scenario will be shortened by 1 or 2 characters. For example, if the input is What is a steering wheel?
, the generated input could be What is a steering whl?
.
Common Misspellings
ScenarioType.COMMON_MISSPELLINGS
Common misspellings of the inputs will be generated. For example, if the input is What is a reciept?
, the generated input could be What is a reviept?
Conditional
ScenarioType.CONDITIONAL
Each input in the scenario will be rephrased as a conditional statement. For example, if the input is What are the side effects of this medicine?
, the generated input could be Considering this medicine, what might be the potential side effects?
.
Rephrase
ScenarioType.REPHRASE_INVARIANT
Rephrasings of the inputs will be generated. For example, if the input is Neil Alden Armstrong was an American astronaut and aeronautical engineer who in 1969 became the first person to walk on the Moon
, the generated input could be Neil Alden Armstrong, an American astronaut and aeronautical engineer, made history in 1969 as the first individual to set foot on the Moon.
Reverse Question
ScenarioType.TEXT_REVERSE_QUESTION
Each input in the scenario will be rephrased as a question that the input should be the answer for. For example, if the input is The first game of baseball was played in 1846.
, the generated input could be When was the first game of baseball ever played?
.
Seed
ScenarioType.SEED
The simplest of all generators. It does nothing. A true NoOp.
Term Relevance
ScenarioType.TERM_RELEVANCE_INVARIANT
Each input in the scenario will be rephrased to only include the most relevant terms, where relevance is based on the list of inputs provided to the scenario. We will then use parts of speech to determine an valid ordering of relevant terms. For example, if the inputs are all names of various milk teas such as Cool Sweet Honey Taro Milk Tea with Brown Sugar Boba
, the generated input could be Taro Milk Tea
, since Taro
, Milk
, and Tea
could be the most relevant terms.
get_scenario_sets
Return one or more scenarios based on the project_id or a specific project_id + scenario_id pair
- Usage
- Details
- Result
import { Okareo, components } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const project_id = "YOUR_PROJECT_ID";
const scenario_id = "YOUR_SCENARIO_ID";
const all_scenarios = await okareo.get_scenario_sets({ project_id });
// or
const specific_scenario = await okareo.get_scenario_sets({ project_id, scenario_id });
Takes two arguments project_id
and scenario_id
.
project_id: string
scenario_id: string // not required
import { components } from "okareo-ts-sdk";
// components["schemas"]["ScenarioSetResponse"][]
/** ScenarioSetResponse */
ScenarioSetResponse: {
/**
* Scenario Id
* Format: uuid
*/
scenario_id: string;
/**
* Project Id
* Format: uuid
*/
project_id: string;
/**
* Time Created
* Format: date-time
*/
time_created: string;
/** Type */
type: string;
/**
* Tags
* @default []
*/
tags?: string[];
/** Name */
name?: string;
/**
* Seed Data
* @default []
*/
seed_data?: components["schemas"]["SeedData"][];
/**
* Scenario Count
* @default 0
*/
scenario_count?: number;
/**
* Scenario Input
* @default []
*/
scenario_input?: string[];
/**
* App Link
* @description This URL links to the Okareo webpage for this scenario set
* @default
*/
app_link?: string;
/** Warning */
warning?: string;
};
get_scenario_data_points
Return each of the datapoints related to a single evaluation run
- Usage
- Details
- Result
import { Okareo, components } from "okareo-ts-sdk";
async get_scenario_data_points(scenario_id: string): Promise<components["schemas"]["ScenarioDataPoinResponse"][]> {
//...
}
Take a single argument scenario_id
scenario_id: string
import { components } from "okareo-ts-sdk";
// components["schemas"]["ScenarioDataPoinResponse"]
/** ScenarioDataPoinResponse */
ScenarioDataPoinResponse: {
/**
* Id
* Format: uuid
*/
id: string;
/** Input */
input: Record<string, never> | unknown[] | string;
/** Result */
result: Record<string, never> | unknown[] | string;
/**
* Meta Data
* Format: json-string
*/
meta_data?: string;
}
get_test_run
Return a previously run test. This is useful for "hill-climbing" where you look at a prior run, make changes and re-run or if you want to baseline the current run from the last.
- Usage
- Details
- Result
import { Okareo, components } from "okareo-ts-sdk";
async get_test_run(test_run_id: string): Promise<components["schemas"]["TestRunItem"]> {
//...
}
Take a single argument test_run_id
test_run_id: string
import { components } from "okareo-ts-sdk";
// components["schemas"]["TestRunItem"]
/** TestRunItem */
TestRunItem: {
/**
* Id
* Format: uuid
*/
id: string;
/**
* Project Id
* Format: uuid
*/
project_id: string;
/**
* Mut Id
* Format: uuid
*/
mut_id: string;
/**
* Scenario Set Id
* Format: uuid
*/
scenario_set_id: string;
/** Name */
name?: string;
/**
* Tags
* @default []
*/
tags?: string[];
/** Type */
type?: string;
/**
* Start Time
* Format: date-time
*/
start_time?: string;
/**
* End Time
* Format: date-time
*/
end_time?: string;
/** Test Data Point Count */
test_data_point_count?: number;
/** Model Metrics */
model_metrics?: Record<string, never>;
/** Error Matrix */
error_matrix?: unknown[];
/**
* App Link
* @description This URL links to the Okareo webpage for this test run
* @default
*/
app_link?: string;
};
register_model
Register the model that you want to evaluate, test or collect datapoints from. Models must be uniquely named within a project namespace.
In order to run a test, you will need to register a model. If you have already registered a model with the same name, the existing model will be returned. The model data is only updated if the "update: true" flag is passed.
The first time a model is defined, the attributes of the model are persisted. Subsequent calls to register_model will return the persisted model. They will not update the definition.
- Usage
- Details
- Result
import { Okareo, CustomModel } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const model_under_test = await okareo.register_model({
name: "Example Custom Model",
tags: ["Custom", "End-2-End"],
project_id: project_id,
models: {
type: "custom",
invoke: (input: string) => {
return {
actual: "Technical Support",
model_response: {
input: input,
method: "hard coded",
context: "Example context response",
}
}
}
} as CustomModel
});
The passed properties for the register function are based on the type of model being used. See the model types for more information.
async register_model(props: any): Promise<components["schemas"]["ModelUnderTestResponse"]> {
//...
}
interface BaseModel {
type?: string | undefined;
tags?: string[] | undefined;
}
export interface TCustomModel extends BaseModel {
invoke: Function;
}
export interface TCustomModelResponse {
actual: any | string;
response: any | string;
}
// components["schemas"]["ModelUnderTestResponse"]
/** ModelUnderTestResponse */
ModelUnderTestResponse: {
/**
* Id
* Format: uuid
*/
id: string;
/**
* Project Id
* Format: uuid
*/
project_id: string;
/** Name */
name: string;
/** Tags */
tags: string[];
/** Time Created */
time_created: string;
/** Datapoint Count */
datapoint_count?: number;
/**
* App Link
* @description This URL links to the Okareo webpage for this model
* @default
*/
app_link?: string;
};
Okareo has ready-to-run integrations with the following models and vector databases. Don't hesitate to reach out if you need another model.
OpenAI (LLM)
import { OpenAIModel } from 'okareo';
interface OpenAIModel extends BaseModel {
type: "openai";
model_id: string;
temperature: number;
system_prompt_template: string;
user_prompt_template: string;
dialog_template: string;
tools?: unknown[];
}
Generation Model (LLM)
import { GenerationModel } from 'okareo';
interface GenerationModel extends BaseModel {
type: "generation";
model_id: string;
temperature: number;
system_prompt_template: string;
user_prompt_template: string;
dialog_template: string;
tools?: unknown[];
}
The GenerationModel is a universal LLM interface that supports most model providers. Users can plug in different model names, including OpenAI, Anthropic, and Cohere models.
Example using Cohere model with GenerationModel:
import { GenerationModel } from 'okareo';
const cohereModel: GenerationModel = {
type: "generation",
model_id: "command-r",
temperature: 0.7,
system_prompt_template: "You are a helpful assistant.",
};
Example with tools:
import { GenerationModel } from 'okareo';
const tools = [
{
type: "function",
function: {
name: "get_current_weather",
description: "Get the current weather in a given location",
parameters: {
type: "object",
properties: {
location: {
type: "string",
description: "The city and state, e.g. San Francisco, CA"
},
unit: {
type: "string",
enum: ["celsius", "fahrenheit"]
}
},
required: ["location"]
}
}
}
];
const modelWithTools: GenerationModel = {
type: "generation",
model_id: "gpt-3.5-turbo-0613",
temperature: 0.7,
system_prompt_template: "You are a helpful assistant with access to weather information.",
tools: tools
};
In these examples, we're using the Cohere "command-r" model and the OpenAI "gpt-3.5-turbo-0613" model through the GenerationModel interface. The second example demonstrates how to include tools, which can be used for function calling capabilities.
Pinecone (VectorDB)
//import { TPineconeDB } from "okareo-ts-sdk";
export interface TPineconeDB extends BaseModel {
type?: string | undefined; // from BaseModel
tags?: string[] | undefined; // from BaseModel
index_name: string;
region: string;
project_id: string;
top_k: string;
}
QDrant (VectorDB)
//import { TQDrant } from "okareo-ts-sdk";
export interface TQDrant extends BaseModel {
type?: string | undefined; // from BaseModel
tags?: string[] | undefined; // from BaseModel
collection_name: string;
url: string;
top_k: string;
}
Custom Model
You can use the CustomModel
object to define your own custom, provider-agnostic models.
//import { TCustomModel } from "okareo-ts-sdk";
export interface TCustomModel extends BaseModel {
invoke(input: string): {
actual: any | string;
model_response: {
input: any | string;
method: any | string;
context: any | string;
}
};
}
To use the CustomModel
object, you will need to implement an invoke
method that returns a ModelInvocation
object. For example,
import { CustomModel, ModelInvocation } from "okareo-ts-sdk";
const my_custom_model: CustomModel = {
type: "custom",
invoke: (input: string) => {
// your model's invoke logic goes here
return {
model_prediction: ...,
model_input: input,
model_output_metadata: {
prediction: ...,
other_data_1: ...,
other_data_2: ...,
...,
},
tool_calls: ...
} as ModelInvocation
}
}
Where the ModelInvocation
's inputs are defined as follows:
export interface ModelInvocation {
/**
* Prediction from the model to be used when running the evaluation,
* e.g. predicted class from classification model or generated text completion from
* a generative model. This would typically be parsed out of the overall model_output_metadata
*/
model_prediction?: Record<string, any> | unknown[] | string;
/**
* All the input sent to the model
*/
model_input?: Record<string, any> | unknown[] | string;
/**
* Full model response, including any metadata returned with model's output
*/
model_output_metadata?: Record<string, any> | unknown[] | string;
/**
* List of tool calls made during the model invocation, if any
*/
tool_calls?: any[];
}
The logic of your invoke
method depends on many factors, chief among them the intended TestRunType
of the CustomModel
. Below, we highlight an example of how to use CustomModel
for each TestRunType
in Okareo.
- Classification
- Retrieval
- Generation
The following CustomModel
classification example is taken from the custommodel.test.ts
script. This model always returns "Technical Support" as the model_prediction
.
const classificationModel = CustomModel({
type: "custom",
invoke: (input: string) => {
return {
model_prediction: "Technical Support",
model_input: input,
model_output_metadata: {
input: input,
method: "hard coded",
context: "Example context"
}
} as ModelInvocation
}
});
Okareo natively supports Pinecone or QDrant models for retrieval. If you want to utilize a different model provider/database, then you can use CustomModel
to do so.
The following CustomModel
retrieval example is taken from the custommodel.test.ts
script. This example assigns random scores
to a random subset of
articleIdsand returns the
id, score` pairs as the model's prediction.
import { CustomModel, ModelInvocation } from "okareo-ts-sdk";
const retrievalModel = CustomModel({
type: "custom",
invoke: (input: string) => {
const articleIds = ["Spring Saver", "Free Shipping", "Birthday Gift", "Super Sunday", "Top 10", "New Arrivals", "January", "July"];
const scores = Array.from({length: 5}, () => ({
id: articleIds[Math.floor(Math.random() * articleIds.length)], // Select a random ID for each score
score: parseFloat(Math.random().toFixed(2)) // Generate a random score
})).sort((a, b) => b.score - a.score); // Sort based on the score
const parsedIdsWithScores = scores.map(({ id, score }) => [id, score])
return {
model_prediction: parsedIdsWithScores,
model_input: input,
model_output_metadata: {
input: input,
}
} as ModelInvocation
}
});
Okareo natively supports most model providers through Generation Model models for generation. If you want to utilize a different model provider/endpoint, then you can use CustomModel
class to do so.
The following snippet makes a POST
request to a generic model provider that can be accessed via an API.
// API key from your desired model provider
const API_KEY = "<YOUR_API_KEY>";
// URL for the API endpoint that calls your model
const MODEL_URL = "<YOUR_MODEL_URL>";
const generationModel = CustomModel({
type: "custom",
invoke: (input: string) => {
// format input_value as messages as reqiured by the API
// here we assume messages are sent to the model as a list
// i.e., [{'role': 'content'}, 'role', 'content']
const messages = [{ "user": input }];
const payload = {
messages: messages
};
const headers = {
"accept": "application/json",
"content-type": "application/json",
"Authorization": `Bearer ${API_KEY}`
};
const response = await fetch(MODEL_URL, {
method: 'POST',
headers: headers,
body: JSON.stringify(payload)
});
const fullModelOutput = await response.json();
const generatedResponse = fullModelOutput.messages[fullModelOutput.messages.length - 1].content;
return {
model_prediction: generatedResponse,
model_input: input,
model_output_metadata: fullModelOutput,
tool_calls: ...,
} as ModelInvocation
}
});
MultiTurnDriver
A MultiTurnDriver
allows you to evaluate a language model over the course of a full conversation. The MultiTurnDriver
is made up of two pieces: a Driver and a Target.
The Driver is defined in your MultiTurnDriver
, while your Target is defined as either a CustomMultiturnTarget
or a GenerationModel
.
// import { MultiTurnDriver, StopConfig } from "okareo-ts-sdk"
export interface MultiTurnDriver extends BaseModel {
type: "driver";
target: GenerationModel | CustomMultiturnTarget;
driver_temperature: number = 0.8
max_turns: bigint = 5
repeats: bigint = 1
first_turn: string = "target"
stop_check: StopConfig
}
Driver
The possible parameters for the Driver are:
driver_temperature: number = 1.0
max_turns: bigint = 5
repeats: bigint = 1
first_turn: string = "target"
stop_check: StopConfig
driver_temperature
defines temperature used in the model that will simulate a user.
max_turns
defines the maximum number of back-and-forth interactions that can be in the conversation.
repeats
defines how many times each row in a scenario will be run when a model is run with run_test
. Since the Driver is non-deterministic, repeating the same row of a scenario can lead to different conversations.
first_turn
defines whether the Target or the Driver will send the first message in the conversation.
stop_check
defines how the check will stop. It requires the check name, and a boolean value defining whether or not it stops on a True or False value returned from the check.
Target
A Target is either a GenerationModel
or a CustomMultiturnTarget
. Refer to GenerationModel for details on GenerationModel
.
The only exception to the standard usage is that a system_prompt_template
is required when using a MultiTurnDriver
. The system_prompt_template
defines the system prompt for how the Target should behave.
A CustomMultiturnTarget
is defined in largely the same way as a CustomModel. The key difference is that the input is a list of messages in OpenAI's message format.
Driver and Target Interaction
The Driver simulates user behavior, while the Target represents the AI model being tested. This setup allows for testing complex scenarios and evaluating the model's performance over extended conversations.
Setting up a scenario
Scenarios in MultiTurnDriver are crafted using SeedData, where the input
field serves as a driver prompt, instructing the simulated user (Driver) on how to behave throughout the conversation, including specific questions to ask, responses to give, and even how to react to the model's function calls, thereby creating a controlled yet dynamic testing environment for evaluating the model's performance across various realistic interaction patterns.
const seedData: SeedData[] = [
{
input: "You are interacting with a customer service agent. First, ask about WebBizz...",
result: "N/A",
},
// ... more seed data
];
Tools and Function Calling
The Target model can be equipped with tools, which are essentially functions the model can call. For instance:
const tools = [
{
type: "function",
function: {
name: "delete_account",
description: "Deletes the user's account",
// ... parameter details
},
}
];
These tools allow the model to perform specific actions, like deleting a user account in this case.
Mocking Tool Results
The driver prompt can be used to mock the results of tool calls. This is crucial for testing how the model responds to different outcomes without actually performing the actions. For example:
const input = `... If you receive any function calls, output the result in JSON format
and provide a JSON response indicating that the deletion was successful.`;
This prompt instructs the Driver to simulate a successful account deletion when the function is called.
Checks and Conversation Control
Checks are used to evaluate specific aspects of the conversation or to control its flow. For instance:
const stopCheck: StopConfig = {
check_name: "task_completion_delete_account",
stop_on: true,
};
This configuration stops the conversation when the account deletion task is completed.
Custom checks can be created to evaluate various aspects of the conversation:
okareo.createOrUpdateCheck({
name: 'task_completion_delete_account',
description: "Check if the agent confirms account deletion",
check: new ModelBasedCheck(/* ... */)
});
These checks can assess task completion, adherence to guidelines, or any other relevant criteria.
upload_scenario_set
Batch upload jsonl formatted data to create a scenario. This is the most efficient method for pushing large data sets for tests and evaluations.
- Usage
- Details
- Result
import { Okareo } from "okareo-ts-sdk";
const okareo = new Okareo({api_key:OKAREO_API_KEY});
const data: any = await okareo.upload_scenario_set(
{
file_path: "example_data/seed_data.jsonl",
scenario_name: "Uploaded Scenario Set",
project_id: project_id
}
);
Takes two arguments, file_path
and scenario_name
async upload_scenario_set(props: UploadScenarioSetProps): Promise<components["schemas"]["ScenarioSetResponse"]> {
//...
}
export interface UploadScenarioSetProps {
project_id: string;
scenario_name: string;
file_path: string;
}
/** ScenarioSetResponse */
ScenarioSetResponse: {
/**
* Scenario Id
* Format: uuid
*/
scenario_id: string;
/**
* Project Id
* Format: uuid
*/
project_id: string;
/**
* Time Created
* Format: date-time
*/
time_created: string;
/** Type */
type: string;
/**
* Tags
* @default []
*/
tags?: string[];
/** Name */
name?: string;
/**
* Seed Data
* @default []
*/
seed_data?: components["schemas"]["SeedData"][];
/**
* Scenario Count
* @default 0
*/
scenario_count?: number;
/**
* Scenario Input
* @default []
*/
scenario_input?: string[];
/**
* App Link
* @description This URL links to the Okareo webpage for this scenario set
* @default
*/
app_link?: string;
};
Reporters
Primarily part of the Okareo Typescript SDK are a set of reporters. The reporters allow you to get rapid feedback in CI or locally from the command line.
Reporters are convenience functions that interpret evaluations based on thresholds that you provide. The reporters are not persisted and do not alter or change the evaluation. They are simply conveniences for rapid summarization locally and in CI.
Singleton Evaluation Reporters
There are two categories of reporters. The singleton reporters are based on specific evaluation types and can report on each. You can set thresholds specific to classification, retrieval, or generation and the reporters will provide detailed pass/fail information. The second category provides trend information. The history reporter takes a list of evaluations along with a threshold instance and returns a table of results over time.
Class ClassificationReporter
The classification reporter takes the evaluated metrics and the confusion matrix and returns a pass/fail, count of errors, and the specific metric that fails.
By convention we define the reporter thresholds independently. This way we can re-use them in trend analysis and across evaluations.
Example console output from passing test:
- Input
- Response
import { ClassificationReporter } from "okareo-ts-sdk";
/*
... body of evaluation
*/
const eval_thresholds = {
error_max: 8,
metrics_min: {
precision: 0.95,
recall: 0.9,
f1: 0.9,
accuracy: 0.95
}
}
const reporter = new ClassificationReporter({
eval_run:classification_run,
...eval_thresholds,
});
reporter.log(); //provides a table of results
/*
// do something if it fails
if (!reporter.pass) { ... }
*/
interface ClassificationReporterResponse {
pass: boolean;
errors: number;
fail_metrics: {
min: {
[key: string]: {
metric: string,
value: number,
expected: number,
}
}
}
}
Response Example
/* Success */
{
pass: true,
errors: 0,
fail_metrics: { }
}
/* Failure */
{
pass: false,
errors: 6,
fail_metrics: {
precision: { metric: 'precision', value: 0.75, expected: 0.95 },
f1: { metric: 'f1', value: 0.7333333333333333, expected: 0.9 }
}
}
Class RetrievalReporter
The retrieval reporter provides a shortcut for metrics @k. Each metric can reference a different k value. The result of the report is always in summary form and only returns metrics that exceed thresholds.
Example console output from failing test:
- Input
- Response
import { classification_reporter } from "okareo-ts-sdk";
/*
... body of evaluation
*/
const report = retrieval_reporter(
{
eval_run:data, // data from a retrieval run
metrics_min: {
'Accuracy@k': {
value: 0.96,
at_k: 3
},
'Precision@k': {
value: 0.5,
at_k: 1 // can use different k values by metric
},
'Recall@k': {
value: 0.8,
at_k: 2 // can use different k values by metric
},
'NDCG@k': {
value: 0.2,
at_k: 3
},
'MRR@k': {
value: 0.96,
at_k: 3
},
'MAP@k': {
value: 0.96,
at_k: 3
}
}
}
);
expect(report.pass).toBeTruthy(); // example report assertion
interface RetrievalReporterResponse {
pass: boolean;
errors: number;
fail_metrics: {
min: {
[key: string]: {
metric: string,
value: number,
expected: number,
k: number;
}
}
}
}
Response Example
/* Success */
{
pass: true,
errors: 0,
fail_metrics: { }
}
/* Failure */
{
pass: false,
errors: 52,
fail_metrics: {
'MRR@k': {
metric: 'MRR@k',
k: 3,
value: 0.8833333333333332,
expected: 0.99
},
'MAP@k': {
metric: 'MAP@k',
k: 3,
value: 0.8833333333333332,
expected: 0.99
}
}
}
Class GenerationReporter
The genration reporter takes an arbitrary list of metric name:value pairs and reports on results that did not meet the minimum threshold defined. Often these metrics are unique to your circumstance. Boolean values will be treated as "0" or "1".
Example console output from failing test:
- Input
- Response
import { classification_reporter } from "okareo-ts-sdk";
/*
... body of evaluation
*/
const report = generation_reporter(
{
eval_run:data,
metrics_min: {
coherence: 4.9,
consistency: 3.2,
fluency: 4.7,
relevance: 4.3,
overall: 4.1
}
}
);
expect(report.pass).toBeTruthy(); // example report assertion
interface GenerationReporterResponse {
pass: boolean;
errors: number;
fail_metrics: {
min: {
[key: string]: {
metric: string,
value: number,
expected: number,
}
},
max: {
[key: string]: {
metric: string,
value: number,
expected: number,
}
},
pass_rate: {
[key: string]: {
metric: string,
value: number,
expected: number,
}
}
}
}
Response Example
/* Success */
{
pass: true,
errors: 0,
fail_metrics: { }
}
/* Failure */
{
pass: false,
errors: 24,
fail_metrics: {
coherence: { metric: 'coherence', value: 3.6041134314518946, expected: 4.9 },
fluency: { metric: 'fluency', value: 2.0248845922814245, expected: 4.7 }
}
}
History Reporter
The second category of reporter provides historical information based on a series of test runs. Like the singletons, each reporter analyzes a single evaluation type at a time. However the mechanism is shared across all types.
Class EvaluationHistoryReporter
The EvaluationHistoryReporter
requires four inputs: the evaluation type, list of evals, assertions, and the number to render. The type must be one of the Okareo TestRunType definitions. The assertions are shared with the singleton reports.
By convention we define the reporter thresholds independently. Re-using thresholds between singleton reports and historic reports is one of the many reasons.
Classification Report
Retrieval Report
Generation Report
- Usage
const history_class = new EvaluationHistoryReporter({
type: TestRunType.MULTI_CLASS_CLASSIFICATION,
evals:[TEST_RUN_CLASSIFICATION as components["schemas"]["TestRunItem"], TEST_RUN_CLASSIFICATION as components["schemas"]["TestRunItem"]],
assertions: class_metrics,
last_n: 5,
});
history_class.log();
Exporting Reports for CI
Class JSONReporter
When using Okareo as part of a CI run, it is useful to export evaluations into a common location that can be picked up by the CI analytics.
By using JSONReporter.log([eval_run, ...])
after each evaluation, Okareo will collect the json results in ./.okareo/reports
. The location can be controlled as part of the CLI with the -r LOCATION
or --report LOCATION
parameters. The output JSON is useful in CI for historical reference.
JSONReporter.log([eval_run, ...])
will output to the console unless the evaluation is initiated by the CLI.
- Usage
import { JSONReporter } from 'okareo-ts-sdk';
const reporter = new JSONReporter([eval_run]);
reporter.log();