Python SDK

Okareo has a rich set of APIs that you can explore through the API Guide.

The okareo python SDK is designed to help accelerate your use of Okareo as part of your python notebook or application.

tip

The SDK requires an API Token. Refer to the Okareo API Key guide for more information.

Installation

pip install okareo

Create an instance

To use Okareo, you will need to instantiate the lib.

    from okareo import Okareo
    okareo = Okareo("YOUR API TOKEN")

Library Methods

`create_scenario_set`

A scenario set is the Okareo unit of data collection. Any scenario can be used to drive a registered model or as a seed for synthetic data generation. Often both.

Usage
Details
Result

    from okareo import Okareo
    okareo = Okareo("YOUR API TOKEN")
    from okareo_api_client.models import ScenarioSetCreate, SeedData

    okareo.create_scenario_set(
        ScenarioSetCreate(name="NAME OF MODEL",
            number_examples=1,
            seed_data=[
                SeedData(
                    input_="Example input to be sent to the model",  
                    result="Expected result from the model"
                ),
            ]
        )
    )

Takes a single argument ScenarioSetCreate

from okareo_api_client.models import ScenarioSetCreate
"""
    name (str): Name of the scenario set
    seed_data (List['SeedData']): Seed data is a list of dictionaries, each with an input and result
    number_examples (int): Number of examples
    project_id (Union[Unset, str]): ID for the project
    generation_type (Union[Unset, ScenarioType]): An enumeration. Default: ScenarioType.REPHRASE_INVARIANT.
"""

from okareo_api_client.models import ScenarioSetResponse
"""
    scenario_id (str):
    project_id (str):
    time_created (datetime.datetime):
    type (str):
    tags (Union[Unset, List[str]]):
    name (Union[Unset, str]):
    seed_data (Union[Unset, List['SeedData']]):
    scenario_count (Union[Unset, int]):
    scenario_input (Union[Unset, List[str]]):
"""

`find_datapoints`

Datapoints are accessible for research and analysis from your Notebook or elsewhere.

Usage
Details
Result

    from okareo import Okareo
    okareo = Okareo("YOUR API TOKEN")
    okareo.find_datapoints(context_token="YOUR UNIQUE TOKEN")

Take a single argument context_token

The context_token is defined when the datapoint is persisted. It is typically meta-data from the model interaction flow.

from okareo_api_client.models import DatapointListItem
"""
    id (str):
    tags (Union[Unset, List[str]]):
    input_ (Union['DatapointListItemInputType0', List[Any], Unset, str]):
    input_datetime (Union[Unset, datetime.datetime]):
    result (Union['DatapointListItemResultType0', List[Any], Unset, str]):
    result_datetime (Union[Unset, datetime.datetime]):
    feedback (Union[Unset, float]):
    error_message (Union[Unset, str]):
    error_code (Union[Unset, str]):
    time_created (Union[Unset, datetime.datetime]):
    context_token (Union[Unset, str]):
    mut_id (Union[Unset, str]):
    project_id (Union[Unset, str]):
    test_run_id (Union[Unset, str]):
"""

`generate_scenario_set`

Generate synthetic data based on a prior scenario. The seed scenario could be from a prior evaluation run, an upload, or statically defined.

Usage
Details
Result

    from okareo import Okareo
    from okareo_api_client.models import ScenarioSetGenerate
    okareo = Okareo("YOUR API TOKEN")
    okareo.generate_scenario_set(
        ScenarioSetGenerate(
            source_scenario_id=create_scenario_set.scenario_id,
            name="generated scenario set",
            number_examples=2,
        )
    )

Take a single argument ScenarioSetGenerate

from okareo_api_client.models import ScenarioSetGenerate
"""
    source_scenario_id (str): ID for the scenario set that the generated scenario set will use as a source
    name (str): Name of the generated scenario set
    number_examples (int): Number of examples to be generated for the scenario set
    project_id (Union[Unset, str]): ID for the project
    generation_type (Union[Unset, ScenarioType]): An enumeration. Default: ScenarioType.REPHRASE_INVARIANT.
"""

from okareo_api_client.models import ScenarioSetResponse
"""
    scenario_id (str):
    project_id (str):
    time_created (datetime.datetime):
    type (str):
    tags (Union[Unset, List[str]]):
    name (Union[Unset, str]):
    seed_data (Union[Unset, List['SeedData']]):
    scenario_count (Union[Unset, int]):
    scenario_input (Union[Unset, List[str]]):
"""

`generate_scenarios`

Generate synthetic data based on a prior scenario. The seed scenario could be from a prior evaluation run, an upload, or statically defined.

Usage
Details
Result

    from okareo import Okareo
    okareo = Okareo("YOUR API TOKEN")
    okareo.generate_scenarios(
        source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
        name="Common Misspellings Scenario",
        number_examples=2,
        generation_type=ScenarioType.COMMON_MISSPELLINGS
    )

Pass in details defining the source and type of synthetic data you want to generate.

"""
    project_id: str <uuid>
    source_scenario: str <uuid>
    name: str
    number_examples: int
    generation_type: ScenarioType
"""

from okareo_api_client.models import ScenarioType
"""
    COMMON_MISSPELLINGS = "COMMON_MISSPELLINGS"
    COMMON_CONTRACTIONS = "COMMON_CONTRACTIONS"
    REPHRASE_INVARIANT = "REPHRASE_INVARIANT"
    CONDITIONAL = "CONDITIONAL"
    TEXT_REVERSE_QUESTION = "TEXT_REVERSE_QUESTION"
    TEXT_REVERSE_LABELED = "TEXT_REVERSE_LABELED"
    TERM_RELEVANCE_INVARIANT = "TERM_RELEVANCE_INVARIANT"
"""

from okareo_api_client.models import ScenarioSetResponse
"""
    scenario_id (str):
    project_id (str):
    time_created (datetime.datetime):
    type (str):
    tags (Union[Unset, List[str]]):
    name (Union[Unset, str]):
    seed_data (Union[Unset, List['SeedData']]):
    scenario_count (Union[Unset, int]):
    scenario_input (Union[Unset, List[str]]):
"""

Okareo has multiple synthetic data generators. We have provided details about each generator type below:

Rephrase

Rephrasings of the inputs will be generated. For example, if the input is Neil Alden Armstrong was an American astronaut and aeronautical engineer who in 1969 became the first person to walk on the Moon, the generated input could be Neil Alden Armstrong, an American astronaut and aeronautical engineer, made history in 1969 as the first individual to set foot on the Moon.

okareo.generate_scenarios(
    source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
    name="Rephrased Scenario",
    number_examples=1,
    generation_type=ScenarioType.REPHRASE_INVARIANT
)

Common Misspellings

Common misspellings of the inputs will be generated. For example, if the input is What is a reciept?, the generated input could be What is a reviept?

okareo.generate_scenarios(
    source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
    name="Common Mispellings Scenario",
    number_examples=1,
    generation_type=ScenarioType.COMMON_MISSPELLINGS
)

Common Contractions

Each input in the scenario will be shortened by 1 or 2 characters. For example, if the input is What is a steering wheel?, the generated input could be What is a steering whl?.

okareo.generate_scenarios(
    source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
    name="Common Contractions Scenario",
    number_examples=1,
    generation_type=ScenarioType.COMMON_CONTRACTIONS
)

Conditional

Each input in the scenario will be rephrased as a conditional statement. For example, if the input is What are the side effects of this medicine?, the generated input could be Considering this medicine, what might be the potential side effects?.

okareo.generate_scenarios(
    source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
    name="Conditional Scenario",
    number_examples=1,
    generation_type=ScenarioType.CONDITIONAL
)

Reverse Question

Each input in the scenario will be rephrased as a question that the input should be the answer for. For example, if the input is The first game of baseball was played in 1846., the generated input could be When was the first game of baseball ever played?.

okareo.generate_scenarios(
    source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
    name="Reverse Question Scenario",
    number_examples=1,
    generation_type=ScenarioType.TEXT_REVERSE_QUESTION
)

Term Relevance

Each input in the scenario will be rephrased to only include the most relevant terms, where relevance is based on the list of inputs provided to the scenario. We will then use parts of speech to determine an valid ordering of relevant terms. For example, if the inputs are all names of various milk teas such as Cool Sweet Honey Taro Milk Tea with Brown Sugar Boba, the generated input could be Taro Milk Tea, since Taro, Milk, and Tea could be the most relevant terms.

okareo.generate_scenarios(
    source_scenario="333155d5-0658-4080-b006-b83ad6c10797",
    name="Term Relevance Scenario",
    number_examples=1,
    generation_type=ScenarioType.TERM_RELEVANCE_INVARIANT
)

`get_scenario_data_points`

Return each of the datapoints related to a single evaluation run

Usage
Details
Result

    from okareo import Okareo
    okareo = Okareo("YOUR API TOKEN")
    okareo.get_scenario_data_points("333155d5-0658-4080-b006-b83ad6c10797")

Take a single argument scenario_id

scenario_id: str <uuid>

from okareo_api_client.models import ScenarioDataPoinResponse
"""
    id (str):
    input_ (Union['ScenarioDataPoinResponseInputType0', List[Any], str]):
    result (Union['ScenarioDataPoinResponseResultType0', List[Any], str]):
    meta_data (Union[Unset, str]):
"""

`register_model`

Register the model that you want to evaluate, test or collect datapoints from. Models must be uniquely named within a project namespace.

warning

The first time a model is defined, the attributes of the model are persisted. Subsequent calls to register_model will return the persisted model. They will not update the definition.

Usage
Details
Result

    from okareo import Okareo
    okareo = Okareo("YOUR API TOKEN")
    okareo.register_model(
        name="Model Classifier",
        model=OpenAIModel(
            model_id="gpt-3.5-turbo",
            temperature=0,
            system_prompt_template=CLASSIFICATION_CONTEXT_TEMPLATE,
            user_prompt_template=None,
        ),
    )

Requires the parameter name. The tags and model parameters are optional. Okareo has a number of pre-made model plugins that you can use off the shelf. Alternatively you can use the callback mechanism to interact with a custom model.

"""
    name: str,
    tags: Union[List[str], None] = None,
    model: BaseModel = None,
"""

"""
    mut_id (str):
    project_id (str):
    name (str):
    tags (List[str]):
"""

Okareo has ready-to-run integrations with the following models and vector databases. Don't hesitate to reach out if you need another model.

OpenAI (LLM)

from okareo.model_under_test import OpenAIModel
"""
    model_id: str
    temperature: float
    system_prompt_template: Optional[str] = None
    user_prompt_template: Optional[str] = None
    dialog_template: Optional[str] = None
    tools: Optional[List] = None
"""

Generation Model (LLM)

from okareo.model_under_test import GenerationModel
"""
    model_id: str
    temperature: float
    system_prompt_template: Optional[str] = None
    user_prompt_template: Optional[str] = None
    dialog_template: Optional[str] = None
    tools: Optional[List] = None
"""

The GenerationModel is a universal LLM interface that supports most model providers. Users can plug in different model names, including OpenAI, Anthropic, and Cohere models.

Example using Cohere model with GenerationModel:

from okareo.model_under_test import GenerationModel

cohere_model = GenerationModel(
    model_id="command-r",
    temperature=0.7,
    system_prompt_template="You are a helpful assistant.",
)

Example with tools:

from okareo.model_under_test import GenerationModel

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    }
]

model_with_tools = GenerationModel(
    model_id="gpt-3.5-turbo-0613",
    temperature=0.7,
    system_prompt_template="You are a helpful assistant with access to weather information.",
    tools=tools
)

In these examples, we're using the Cohere "command-r" model and the OpenAI "gpt-3.5-turbo-0613" model through the GenerationModel interface. The second example demonstrates how to include tools, which can be used for function calling capabilities.

Pinecone (VectorDB)

from okareo.model_under_test import PineconeDB
"""
    index_name: str
    region: str
    project_id: str
    top_k: int = 5
"""

QDrant (VectorDB)

from okareo.model_under_test import QdrantDB
"""
    collection_name: str
    url: str
    top_k: int = 5
"""

`CustomModel` / `ModelInvocation`

You can use the CustomModel object to define your own custom, provider-agnostic models.

from okareo.model_under_test import CustomModel
"""
    name: str

    @abstractmethod
    def invoke(
        self, input_value: Union[dict, list, str]
    ) -> Union[ModelInvocation, Any]:
        pass

"""

To use the CustomModel object, you will need to create a child class that defines an invoke method that returns a ModelInvocation object. For example,

from okareo.model_under_test import CustomModel, ModelInvocation

class MyCustomModel(CustomModel):
    def invoke(self, input_value: Union[dict, list, str]) -> ModelInvocation:
        # your model's invoke logic goes here
        return ModelInvocation(
            model_prediction=...,
            model_input=...,
            model_output_metadata=...,
            tool_calls=...
        )

Where the ModelInvocation's inputs are defined as follows:

class ModelInvocation:
    """Model invocation response object returned from a CustomModel.invoke method"""

    model_prediction: Union[dict, list, str, None] = None
    """Prediction from the model to be used when running the evaluation,
    e.g. predicted class from classification model or generated text completion from
    a generative model. This would typically be parsed out of the overall model_output_metadata."""

    model_input: Union[dict, list, str, None] = None
    """All the input sent to the model"""

    model_output_metadata: Union[dict, list, str, None] = None
    """Full model response, including any metadata returned with model's output"""

    tool_calls: Optional[List] = None
    """List of tool calls made during the model invocation, if any"""

The logic of your invoke method depends on many factors, chief among them the intended TestRunType of the CustomModel. Below, we highlight an example of how to use CustomModel for each TestRunType in Okareo.

Classification
Retrieval
Generation

The following snippet is taken from the classification_eval.ipynb example notebook. The underlying model is a distilbert model trained to classify queries into one of three categories. The model weights are available on Okareo's Hugging Face repository.

# Load all of the necessary libraries from Okareo
from okareo import Okareo
from okareo_api_client.models import ScenarioSetCreate, SeedData
from okareo.model_under_test import CustomModel, ModelInvocation

# Load the torch library
import torch

# Create an instance of the Okareo client
okareo = Okareo(OKAREO_API_KEY)

# Define a model class that will be used used for classification
# The model takes in a scenario and returns a predicted class
class ClassificationModel(CustomModel):
    # Constructor for the model
    def __init__(self, name, tokenizer, model):
        self.name = name
        # The pretrained tokenizer
        self.tokenizer = tokenizer
        # The pretrained model
        self.model = model
        # The possible labels for the model
        self.label_lookup = ["pricing", "returns", "complaints"]

    # Callable to be applied to each scenario in the scenario set
    def invoke(self, input: str):
        # Tokenize the input
        encoding = self.tokenizer(input, return_tensors="pt", padding="max_length", truncation=True, max_length=512)
        # Get the logits from the model
        logits = self.model(**encoding).logits
        # Get the index of the highest value (the predicted class)
        idx = torch.argmax(logits, dim=1).item()
        # Get the label for the predicted class
        prediction = self.label_lookup[idx]

        # Return the prediction in a ModelInvocation object
        return ModelInvocation(
                model_prediction=prediction,
                model_input=input,
                model_output_metadata={ "prediction": prediction, "confidence": logits.softmax(dim=1).max().item() },
            )

Okareo natively supports Pinecone or QDrant models for retrieval. If you want to utilize a different model provider/database, then you can use CustomModel to do so.

The following CustomModel retrieval example is taken from the retrieval_eval.ipynb notebook. This example shows how to set up a ChromaDB collection how to query the collection inside of the CustomModel.invoke() method.

# Import ChromaDB
import chromadb

# Create a ChromaDB client
chroma_client = chromadb.Client()

# Create a ChromaDB collection
# The collection will be used to store the documents as vector embeddings
# We want to measure the similarity between questions and documents using cosine similarity
collection = chroma_client.create_collection(name="retrieval_test", metadata={"hnsw:space": "cosine"})

# Add the documents to the collection with the corresponding metadata
collection.add(
    documents=list(jsonObj.input),
    ids=list(jsonObj.result),
    metadatas=metadata_list
)

# A funtion to convert the query results from our ChromaDB collection into a list of dictionaries with the document ID, score, metadata, and label
def query_results_to_score(results):
    parsed_ids_with_scores = []
    for i in range(0, len(results['distances'][0])):
        # Create a score based on cosine similarity
        score = (2 - results['distances'][0][i]) / 2
        parsed_ids_with_scores.append(
            {
                "id": results['ids'][0][i],
                "score": score,
                "metadata": results['metadatas'][0][i],
                "label": f"{results['metadatas'][0][i]['article_type']} WebBizz Article w/ ID: {results['ids'][0][i]}"
            }
        )
    return parsed_ids_with_scores

# Define a custom retrieval model that uses the ChromaDB collection to retrieve documents
# The model will return the top 5 most relevant documents based on the input query
class RetrievalModel(CustomModel):
    def invoke(self, input: str) -> ModelInvocation:
        # Query the collection with the input text
        results = collection.query(
            query_texts=[input],
            n_results=5
        )
        # Return formatted query results and the model response context
        return ModelInvocation(
            model_input=input,
            model_prediction=query_results_to_score(results),
            model_output_metadata={'model_data': input}
        )

Okareo natively supports most model providers through Generation Model models for generation. If you want to utilize a different model provider/endpoint, then you can use CustomModel class to do so.

The following snippet uses the requests library to call a model provider that can be accessed via an API.

import json
import request

class GenerationModel(CustomModel):
    def __init__(self, name, api_key, url):
    self.name = name
    # API key from your desired model provider
    self.api_key = api_key
    # URL for the API endpoint that calls your model
    self.url = url

    def invoke(self, input_value):
        # format input_value as messages as reqiured by the API
        # here we assume messages are sent to the model as a list
        # i.e., [{'role': 'content'}, 'role', 'content']
        messages = [{'user': input_value}]
        payload = {
            "messages": messages
        }
        headers = {
            "accept": "application/json",
            "content-type": "application/json",
            "Authorization": f"Bearer {self.api_key}" 
        }
        response = requests.post(self.url, json=payload, headers=headers)
        parsed_response = json.loads(response.text)
        full_model_output = response.text
        generated_response = full_model_output["messages"][-1]["content]
        return ModelInvocation(
            model_input=input_value,
            model_prediction=generated_response,
            model_output_metadata=full_model_output,
            tool_calls=...,
        )

`CustomBatchModel`

If your custom model can handle batch inference, then you can use the CustomBatchModel class.

from okareo.model_under_test import CustomBatchModel
"""
    name: str
    batch_size: int = 1

    @abstractmethod
    def invoke_batch(
        self, input_batch: list[dict[str, Union[dict, list, str]]]
    ) -> list[dict[str, Union[ModelInvocation, Any]]]:
        '''method for taking a batch of scenario inputs and returning a corresponding batch of model outputs

        arguments:
        -> input_batch: list[dict[str, Union[dict, list, str]]] - batch of inputs to the model. Expects a list of
        dicts of the format { 'id': str, 'input_value': Union[dict, list, str] }.

        returns:
        -> list of dicts of format { 'id': str, 'model_invocation': Union[ModelInvocation, Any] }. 'id' must match
        the corresponding input_batch element's 'id'.
        '''
"""

CustomBatchModel can be useful if you are using a Hugging Face model/tokenizer on a GPU, allowing you to speed up your evaluations with batched inference calls.

To use the CustomBatchModel class, you will need to create a child class that defines an invoke_batch method that returns a list of dicts. For example,

from okareo.model_under_test import CustomBatchModel, ModelInvocation

class MyCustomBatchModel(CustomBatchModel):
    def invoke_batch(
        self, input_batch: list[dict[str, Any]]
    ) -> list[dict[str, Union[str, ModelInvocation]]]:
        invocations = []
        input_values = [d['input_value'] for d in input_batch]
        batch_ids = [d['id'] for d in input_batch]
        predictions = ... # your batch processing logic goes here
        for i in range(min(len(input_batch), self.batch_size)):
            invocation = ModelInvocation(
                model_prediction=predictions[i],
                model_input=input_values[i],
                model_output_metadata=...,
                tool_calls=...,
            )
            invocations.append({"id": batch_ids[i], "model_invocation": invocation})
        return invocations

warning

Please ensure that your batch processing logic assigns the correct "id" to the corresponding "model_invocation" or your evaluations will return incorrect results.

When you instantiate your batch model for an evaluation, you will pass your desired batch_size as an argument as follows:

my_batch_model=MyCustomBatchModel(
    name="my custom batch model",
    batch_size=4,
),

`MultiTurnDriver`

A MultiTurnDriver allows you to evaluate a language model over the course of a full conversation. The MultiTurnDriver is made up of two pieces: a Driver and a Target.

The Driver is defined in your MultiTurnDriver, while your Target is defined as either a CustomMultiturnTarget or a GenerationModel.

from okareo.model_under_test import MultiTurnDriver, OpenAIModel, StopConfig
"""
    driver_temperature: float = 1.0
    max_turns: int = 5
    repeats: int = 1
    first_turn: string: "target"
    target: CustomMultiturnTarget | GenerationModel | OpenAIModel
    stop_check: StopConfig
"""

Driver

The possible parameters for the Driver are:

"""
    driver_temperature: float = 1.0
    max_turns: int = 5
    repeats: int = 1
    first_turn: string: "target"
    stop_check: StopConfig
"""

driver_temperature defines temperature used in the model that will simulate a user.

max_turns defines the maximum number of back-and-forth interactions that can be in the conversation.

repeats defines how many times each row in a scenario will be run when a model is run with run_test. Since the Driver is non-deterministic, repeating the same row of a scenario can lead to different conversations.

first_turn defines whether the Target or the Driver will send the first message in the conversation.

stop_check defines how the check will stop. It requires the check name, and a boolean value defining whether or not it stops on a True or False value returned from the check.

Target

A Target is either a GenerationModel or a CustomMultiturnTarget. Refer to GenerationModel for details on GenerationModel.

The only exception to the standard usage is that a system_prompt_template is required when using a MultiTurnDriver. The system_prompt_template defines the system prompt for how the Target should behave.

A CustomMultiturnTarget is defined in largely the same way as a CustomModel. The key difference is that the input is a list of messages in OpenAI's message format.

Driver and Target Interaction

The Driver simulates user behavior, while the Target represents the AI model being tested. This setup allows for testing complex scenarios and evaluating the model's performance over extended conversations.

Setting up a scenario

Scenarios in MultiTurnDriver are crafted using SeedData, where the input_ field serves as a driver prompt, instructing the simulated user (Driver) on how to behave throughout the conversation, including specific questions to ask, responses to give, and even how to react to the model's function calls, thereby creating a controlled yet dynamic testing environment for evaluating the model's performance across various realistic interaction patterns.

seed_data = [
    SeedData(
        input_="You are interacting with a customer service agent. First, ask about WebBizz...",
        result="N/A",
    ),
    # ... more seed data
]

Tools and Function Calling

The Target model can be equipped with tools, which are essentially functions the model can call. For instance:

tools=[
    {
        "type": "function",
        "function": {
            "name": "delete_account",
            "description": "Deletes the user's account",
            # ... parameter details
        },
    }
]

These tools allow the model to perform specific actions, like deleting a user account in this case.

Mocking Tool Results

The driver prompt can be used to mock the results of tool calls. This is crucial for testing how the model responds to different outcomes without actually performing the actions. For example:

input_="""... If you receive any function calls, output the result in JSON format 
and provide a JSON response indicating that the deletion was successful."""

This prompt instructs the Driver to simulate a successful account deletion when the function is called.

Checks and Conversation Control

Checks are used to evaluate specific aspects of the conversation or to control its flow. For instance:

stop_check=StopConfig(check_name="task_completion_delete_account", stop_on=True)

This configuration stops the conversation when the account deletion task is completed.

Custom checks can be created to evaluate various aspects of the conversation:

okareo.create_or_update_check(
    name='task_completion_delete_account',
    description="Check if the agent confirms account deletion",
    check=ModelBasedCheck(...)
)

These checks can assess task completion, adherence to guidelines, or any other relevant criteria.

`run_test`

Run a test directly from a registered model. This requires both a registered model and at least one scenario.

The run_test function is called on a registered model in the form model_under_test.run_test(...). If your model requires an API key to call, then you will need to pass your key in the api_key parameter. Your api_keys are not stored by Okareo.

warning

Depending on size and complexity, model runs can take a long time to evaluate. Use scenarios appropriate in size to the task at hand.

Classification
Retrieval
Generation

Read the Classification Overview to learn more about classification evaluations in Okareo.

# Classification evaluations return accuracy, precision, recall, and f1 scores.
model_under_test = okareo.register_model(...)
test_run_response = model_under_test.run_test(
    name="<YOUR_TEST_RUN_NAME>",
    scenario="<YOUR_SCENARIO_ID>",
    api_key="<YOUR_MODEL_API_KEY>", # required for OpenAI, Cohere, Pinecone, QDrant, etc.
    test_run_type=TestRunType.MULTI_CLASS_CLASSIFICATION,
)

"""
test_run_response: TestRunItem(
    id=str,
    project_id=str,
    mut_id=str,
    scenario_set_id=str,
    name=str,
    tags=list[str],
    type='MULTI_CLASS_CLASSIFICATION',
    start_time=datetime.datetime,
    end_time=datetime.datetime,
    test_data_point_count=int,
    model_metrics=TestRunItemModelMetrics(
        additional_properties={
            'weighted_average': {
                'precision': float,
                'recall': float,
                'f1': float,
                'accuracy': float
            },
            'scores_by_label': {
                'label_1': {
                    'precision': float,
                    'recall': float,
                    'f1': float
                },
                ...,
                'label_N': {
                    'precision': float,
                    'recall': float,
                    'f1': float
                },
            }
        }),
    error_matrix=[
        {'label_1': [int, ..., int]},
        ...,
        {'label_N': [int, ..., int]}
    ],
    app_link=str,
    additional_properties={})
"""

Read the Retrieval Overview to learn more about retrieval evaluations in Okareo.

# Specify retrieval metrics and corresponding K values.
# Below, we use the same k_vals for all available metrics,
# but you can specify any subset of these metrics with 
# different sets of K values to evaluate.
k_max = 5
k_vals = [i for i in range(1,k_max+1)]
metrics_kwargs={
    "accuracy_at_k": k_vals,
    "precision_recall_at_k": k_vals,
    "ndcg_at_k": k_vals,
    "mrr_at_k": k_vals,
    "map_at_k": k_vals,
}

model_under_test = okareo.register_model(...)
test_run_response = model_under_test.run_test(
    name="<YOUR_TEST_RUN_NAME>",
    scenario="<YOUR_SCENARIO_ID>",
    test_run_type=TestRunType.INFORMATION_RETRIEVAL,
    api_key="<YOUR_MODEL_API_KEY>", # required for OpenAI, Cohere, Pinecone, QDrant, etc.
    metrics_kwargs=metrics_kwargs,
)

"""
test_run_response: TestRunItem(
    id=str,
    project_id=str,
    mut_id=str,
    scenario_set_id=str,
    name=str,
    tags=list[str],
    type='INFORMATION_RETRIEVAL',
    start_time=datetime.datetime,
    end_time=datetime.datetime,
    test_data_point_count=int,
    model_metrics=TestRunItemModelMetrics(
        additional_properties={
            'Accuracy@k': {'1': float, ..., '5': float},
            'Precision@k': {'1': float, ..., '5': float},
            'Recall@k': {'1': float, ..., '5': float},
            'NDCG@k': {'1': float, ..., '5': float},
            'MRR@k': {'1': float, ..., '5': float},
            'MAP@k': {'1': float, ..., '5': float},
            'row_level_metrics': {
                '<UUID-FOR-ROW-1>': {
                    '1': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
                    ...,
                    '5': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
                },
                ...,
                '<UUID-FOR-ROW-N>': {
                    '1': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
                    ...,
                    '5': {'accuracy': float, 'precision': float, 'recall': float, 'mrr': float, 'ndcg': float, 'map': float},
                }
            }
        }
    ),
    error_matrix=[],
    app_link=str,
    additional_properties={}
)
"""

To perform evaluations of generative models, you will need to specify your desired checks.

Read the Generation Overview to learn more about generation evaluations in Okareo.

# Specify your desired checks for the test run.
checks=["<CHECK_NAME_1>", ..., "<CHECK_NAME_N>"]

model_under_test = okareo.register_model(...)
test_run_response = model_under_test.run_test(
    name="<YOUR_TEST_RUN_NAME>",
    scenario="<YOUR_SCENARIO_ID>",
    test_run_type=TestRunType.NL_GENERATION,
    api_key="<YOUR_MODEL_API_KEY>", # required for OpenAI, Cohere, Pinecone, QDrant, etc.
    checks=checks        
)

"""
test_run_response: TestRunItem(
    id=str,
    project_id=str,
    mut_id=str,
    scenario_set_id=str,
    name=str,
    tags=list[str],
    type='NL_GENERATION',
    start_time=datetime.datetime,
    end_time=datetime.datetime,
    test_data_point_count=int,
    model_metrics=TestRunItemModelMetrics(
        additional_properties={
            'mean_scores': {
                'CHECK_NAME_1' : float,
                ...,
                'CHECK_NAME_N': float,
            },
            'scores_by_row': [
                {
                    'scenario_index': 1,
                    'test_id': "UUID-FOR-ROW-1",
                    'CHECK_NAME_1': float,
                    ...,
                    'CHECK_NAME_N': float,
                },
                ...,
                {
                    'scenario_index': M,
                    'test_id': "UUID-FOR-ROW-M",
                    'CHECK_NAME_1': float,
                    ...,
                    'CHECK_NAME_N': float,
                }
            ]
        }
    ),
    error_matrix=[],
    app_link=str,
    additional_properties={},
)
"""

`upload_scenario_set`

Batch upload jsonl formatted data to create a scenario. This is the most efficient method for pushing large data sets for tests and evaluations.

Usage
Details
Result

from okareo import Okareo
okareo = Okareo("YOUR API TOKEN")
okareo.upload_scenario_set(file_path='./evaluation_dataset.jsonl', scenario_name="Retrieval Facts Scenario")

Takes two arguments, file_path and scenario_name

"""
    scenario_name: str, 
    file_path: str
"""

from okareo_api_client.models import ScenarioSetResponse
"""
    scenario_id (str):
    project_id (str):
    time_created (datetime.datetime):
    type (str):
    tags (Union[Unset, List[str]]):
    name (Union[Unset, str]):
    seed_data (Union[Unset, List['SeedData']]):
    scenario_count (Union[Unset, int]):
    scenario_input (Union[Unset, List[str]]):
"""

`get_all_checks`

Return the list of all available checks. The returned list will include both predefined checks in Okareo as well as custom checks uploaded in association with your current organization.

Usage
Result

okareo.get_all_checks()

from okareo_api_client.models.evaluator_brief_response import EvaluatorBriefResponse
from okareo_api_client.models.evaluator_brief_response_check_config import EvaluatorBriefResponseCheckConfig
"""
[
    EvaluatorBriefResponse(
        id=str,
        name=str,
        description=str,
        output_data_type=str,
        time_created=datetime.datetime,
        check_config=EvaluatorBriefResponseCheckConfig(
            additional_properties={
                'type': 'score' | 'pass_fail',
                'code_contents': str, # if CodeBasedCheck
                'prompt_template': str, # if ModelBasedCheck
            }
        ),
        additional_properties={},
    ),
    ...,
    EvaluatorBriefResponse(...),
]
""" 

`get_check`

Returns a detailed check response object. Useful if you have a check's ID and want to get more information about the check.

Usage
Result

okareo.get_check("<UUID-FOR-CHECK>")

from okareo_api_client.models.evaluator_detailed_response import EvaluatorDetailedResponse
from okareo_api_client.models.evaluator_detailed_response_check_config import EvaluatorDetailedResponseCheckConfig
"""
EvaluatorDetailedResponse(
    id=str,
    project_id=str,
    name=str,
    description=str,
    requires_scenario_input=bool,
    requires_scenario_result=bool,
    output_data_type='score' | 'pass_fail',
    code_contents=str,
    time_created=datetime.datetime,
    warning=None,
    check_config=EvaluatorDetailedResponseCheckConfig(
        additional_properties={
            'type': 'score' | 'pass_fail',
            'code_contents': str, # if CodeBasedCheck
            'prompt_template': str, # if ModelBasedCheck
        }),
    additional_properties={}
)
""" 

`create_or_update_check`

Uploads or updates a check with the specified name. If the name for the check exists already and the check name is not shared with a predefined Okareo check, then that check will be overwritten. Returns a detailed check response object.

Usage
Result

from okareo.checks import BaseCheck

okareo.create_or_update_check(
    name: str,
    description: str,
    check: okareo.checks.BaseCheck,
)

from okareo_api_client.models.evaluator_detailed_response import EvaluatorDetailedResponse
from okareo_api_client.models.evaluator_detailed_response_check_config import EvaluatorDetailedResponseCheckConfig
"""
EvaluatorDetailedResponse(
    id=str,
    project_id=str,
    name=str,
    description=str,
    requires_scenario_input=bool,
    requires_scenario_result=bool,
    output_data_type='score' | 'pass_fail',
    code_contents=str,
    time_created=datetime.datetime,
    warning=None,
    check_config=EvaluatorDetailedResponseCheckConfig(
        additional_properties={
            'type': 'score' | 'pass_fail',
            'code_contents': str, # if CodeBasedCheck
            'prompt_template': str, # if ModelBasedCheck
        }),
    additional_properties={}
)
""" 

`BaseCheck`

Base class for creating a custom check. Generally, we advise using CodeBasedCheck or ModelBasedCheck instead of BaseCheck. The metadata field stores additional properties from the model invocation such as latency in milliseconds and tool calls.

from okareo.checks import BaseCheck
'''
class BaseCheck(ABC):
    """
    Base class for defining checks
    """

    @staticmethod
    @abstractmethod
    def evaluate(
        model_output: str, scenario_input: str, scenario_result: str, metadata: dict
    ) -> Union[bool, int, float]:
        """
        Evaluate your model output, scenario input, scenario result, and metadata
        to determine if the data should pass or fail the check.
        """

    def check_config(self) -> dict:
        """
        Returns a dictionary of configuration parameters that will be passed to the API.
        """
        return {}
'''

Metadata example

An example of what the metadata passed into a check could look like.

{
   "latency": 683.079,
   "tool_calls": [
      {
         "function": {
            "arguments": {
               "location": "San Francisco, CA"
            },
            "name": "get_current_weather"
         },
         "id": "call_E4SDRy9Hgi5t9Vt5g73Rxjst",
         "type": "function"
      }
   ]
}

`CodeBasedCheck`

A custom check class that uses Python code to evaluate the data.

To use this check:

Create a new .py file (not in a notebook).
In this file, define a class named Check that inherits from CodeBasedCheck.
Implement the evaluate method in your Check class.
Include any additional code used by your check in the same file.

Example of uploading a custom CodeBasedCheck:

Usage
Result
my_custom_check.py

from my_custom_check import Check

uploaded_check = okareo_client.create_or_update_check(
    name="check_sample_code",
    description="a description of the custom check",
    check=Check(),
)

from okareo_api_client.models.evaluator_detailed_response import EvaluatorDetailedResponse
from okareo_api_client.models.evaluator_detailed_response_check_config import EvaluatorDetailedResponseCheckConfig
"""
EvaluatorDetailedResponse(
    id=str,
    project_id=str,
    name=str,
    description=str,
    requires_scenario_input=bool,
    requires_scenario_result=bool,
    output_data_type='score' | 'pass_fail',
    code_contents=str,
    time_created=datetime.datetime,
    warning=None,
    check_config=EvaluatorDetailedResponseCheckConfig(
        additional_properties={
            'type': 'score' | 'pass_fail',
            'code_contents': str,
        }),
    additional_properties={}
)
""" 

from okareo.checks import CodeBasedCheck
# any other imports required for your check

class Check(CodeBasedCheck):
    @staticmethod
    def evaluate(
        model_output: str, scenario_input: str, scenario_result: str
    ) -> Union[bool, int, float]:
        # Your code here 
        output = ...
        return output

`ModelBasedCheck`

Check that uses a prompt template to evaluate the data.

The prompt template should be a string that includes at least one of the following placeholders, which will be replaced with the actual values:

{model_output} -> corresponds to the model's output
{scenario_input} -> corresponds to the scenario input
{scenario_result} -> corresponds to the scenario result

Example of how a template could be used: "Count the words in the following: {model_output}"

The check output type should be one of the following:

CheckOutputType.SCORE -> this template should ask prompt the model a score (single number)
CheckOutputType.PASS_FAIL -> this template should prompt the model for a boolean value (True/False)

Example of uploading a custom ModelBasedCheck:

Usage (Pass/Fail)
Usage (Score)
Result

from okareo.checks import CheckOutputType, ModelBasedCheck 
uploaded_check = okareo.create_or_update_check(
    name=f"check_sample_pass_fail",
    description="a description of the custom check",
    check=ModelBasedCheck(
        prompt_template="Only output True if the model_output is at least 20 characters long, otherwise return False.",
        check_type=CheckOutputType.PASS_FAIL,
    ),
)

from okareo.checks import CheckOutputType, ModelBasedCheck 
uploaded_check = okareo.create_or_update_check(
    name=f"check_sample_score",
    description="a description of the custom check",
    check=ModelBasedCheck(
        prompt_template="Only output the number of words in the following text: {scenario_input} {output} {model_output}",
        check_type=CheckOutputType.SCORE,
    ),
)

from okareo_api_client.models.evaluator_detailed_response import EvaluatorDetailedResponse
from okareo_api_client.models.evaluator_detailed_response_check_config import EvaluatorDetailedResponseCheckConfig
"""
EvaluatorDetailedResponse(
    id=str,
    project_id=str,
    name=str,
    description=str,
    requires_scenario_input=bool,
    requires_scenario_result=bool,
    output_data_type='score' | 'pass_fail',
    code_contents=str,
    time_created=datetime.datetime,
    warning=None,
    check_config=EvaluatorDetailedResponseCheckConfig(
        additional_properties={
            'type': 'score' | 'pass_fail',
            'prompt_template': str,
        }),
    additional_properties={}
)
""" 

`generate_check`

Generates the contents of a .py file for implementing a CodeBasedCheck based on an EvaluatorSpecRequest. Write the generated_code of this method's result to a .py file, then see CodeBasedCheck above for details on how to upload generated the check.

Usage
Result

from okareo_api_client.models import EvaluatorSpecRequest

generate_request = EvaluatorSpecRequest(
    description="""
    Return True if the model_output is at least 20 characters long, otherwise return False.
    """,
    requires_scenario_input=False, # True if check uses scenario input
    requires_scenario_result=False, # True if check uses scenario result
    output_data_type="bool", # if pass/fail: 'bool'. if score: 'int' | 'float'
) 
check = okareo.generate_check(generate_request)

from okareo_api_client.models.evaluator_generate_response import EvaluatorGenerateResponse
"""
EvaluatorGenerateResponse(
    name: str,
    description: str,
    requires_scenario_input: bool,
    requires_scenario_result: bool,
    output_data_type: str,
    generated_code: str
)
"""

`delete_check`

Deletes the check with the provided ID and name.

okareo.delete_check("<CHECK-UUID>", "<CHECK-NAME>")
"""
Check deletion was successful
"""

Reporters (Experimental)

Okareo is experimenting with a reporting mechanism for CI integration. Most of the experiments are in Typescript. However, the core JSONReporter is also available in Python.

Exporting Reports for CI

`Class JSONReporter`

When using Okareo as part of a CI run, it is useful to export evaluations into a common location that can be picked up by the CI analytics.

By using JSONReporter.log([eval_run, ...]) after each evaluation, Okareo will collect the json results in ./.okareo/reports. The location can be controlled as part of the CLI with the -r LOCATION or --report LOCATION parameters. The output JSON is useful in CI for historical reference.

info

JSONReporter.log([eval_run, ...]) will output to the console unless the evaluation is initiated by the CLI.

Usage

from okareo.reporter import JSONReporter
reporter = JSONReporter([eval_item])
reporter.log()

Python SDK

Installation​

Create an instance​

Library Methods​

create_scenario_set​

find_datapoints​

generate_scenario_set​

generate_scenarios​

Rephrase​

Common Misspellings​

Common Contractions​

Conditional​

Reverse Question​

Term Relevance​

get_scenario_data_points​

register_model​

OpenAI (LLM)​

Generation Model (LLM)​

Pinecone (VectorDB)​

QDrant (VectorDB)​

CustomModel / ModelInvocation​

CustomBatchModel​

MultiTurnDriver​

Driver​

Target​

Driver and Target Interaction​

Setting up a scenario​

Tools and Function Calling​

Mocking Tool Results​

Checks and Conversation Control​

run_test​

upload_scenario_set​

get_all_checks​

get_check​

create_or_update_check​

BaseCheck​

Metadata example​

CodeBasedCheck​

ModelBasedCheck​

generate_check​

delete_check​

Reporters (Experimental)​

Exporting Reports for CI​

Class JSONReporter​

Installation

Create an instance

Library Methods

`create_scenario_set`

`find_datapoints`

`generate_scenario_set`

`generate_scenarios`

Rephrase

Common Misspellings

Common Contractions

Conditional

Reverse Question

Term Relevance

`get_scenario_data_points`

`register_model`

OpenAI (LLM)

Generation Model (LLM)

Pinecone (VectorDB)

QDrant (VectorDB)

`CustomModel` / `ModelInvocation`

`CustomBatchModel`

`MultiTurnDriver`

Driver

Target

Driver and Target Interaction

Setting up a scenario

Tools and Function Calling

Mocking Tool Results

Checks and Conversation Control

`run_test`

`upload_scenario_set`

`get_all_checks`

`get_check`

`create_or_update_check`

`BaseCheck`

Metadata example

`CodeBasedCheck`

`ModelBasedCheck`

`generate_check`

`delete_check`

Reporters (Experimental)

Exporting Reports for CI

`Class JSONReporter`