okareo.model_under_test

ModelUnderTest Objects

class ModelUnderTest(AsyncProcessorMixin)

A class for managing a Model Under Test (MUT) in Okareo. Returned by okareo.register_model()

submit_test

def submit_test(
        scenario: Union[ScenarioSetResponse, str, UUID],
        name: str,
        api_key: Optional[str] = None,
        api_keys: Optional[dict] = None,
        metrics_kwargs: Optional[dict] = None,
        test_run_type: TestRunType = TestRunType.MULTI_CLASS_CLASSIFICATION,
        calculate_metrics: bool = True,
        checks: Optional[List[str]] = None,
        simulation_params: Optional[Any] = None,
        driver_id: Optional[str] = None) -> TestRunItem

Asynchronous server-based version of test-run execution. For CustomModels, model invocations are handled client-side in a background thread then evaluated server-side asynchronously. For other models, model invocations and evaluation are both handled server-side asynchronously.

Arguments:

scenario Union[ScenarioSetResponse, str] - The scenario set or identifier to use for the test run.
name str - The name to assign to the test run.
api_key Optional[str] - Optional API key for authentication.
api_keys Optional[dict] - Optional dictionary of API keys for different services.
metrics_kwargs Optional[dict] - Optional dictionary of keyword arguments for metrics calculation.
test_run_type TestRunType - The type of test run to execute. Defaults to MULTI_CLASS_CLASSIFICATION.
calculate_metrics bool - Whether to calculate metrics after the test run. Defaults to True.
checks Optional[List[str]] - Optional list of checks to perform during the test run.

Returns:

TestRunItem - The resulting test run item for the submitted test run. The id field can be used to retrieve the test run.

run_test

def run_test(
        scenario: Union[ScenarioSetResponse, str, UUID],
        name: str,
        api_key: Optional[str] = None,
        api_keys: Optional[dict] = None,
        metrics_kwargs: Optional[dict] = None,
        test_run_type: TestRunType = TestRunType.MULTI_CLASS_CLASSIFICATION,
        calculate_metrics: bool = True,
        checks: Optional[List[str]] = None,
        simulation_params: Optional[Any] = None,
        driver_id: Optional[str] = None) -> TestRunItem

Server-based version of test-run execution. For CustomModels, model invocations are handled client-side then evaluated server-side. For other models, model invocations and evaluations handled server-side.

Arguments:

scenario Union[ScenarioSetResponse, str] - The scenario set or identifier to use for the test run.
name str - The name to assign to the test run.
api_key Optional[str] - Optional API key for authentication.
api_keys Optional[dict] - Optional dictionary of API keys for different services.
metrics_kwargs Optional[dict] - Optional dictionary of keyword arguments for metrics calculation.
test_run_type TestRunType - The type of test run to execute. Defaults to MULTI_CLASS_CLASSIFICATION.
calculate_metrics bool - Whether to calculate metrics after the test run. Defaults to True.
checks Optional[List[str]] - Optional list of checks to perform during the test run.

Returns:

TestRunItem - The resulting test run item for the completed test run.

get_test_run

def get_test_run(test_run_id: Union[str, UUID]) -> TestRunItem

Retrieve a test run by its ID.

Arguments:

test_run_id str - The ID of the test run to retrieve.

Returns:

TestRunItem - The test run item corresponding to the provided ID.

ModelInvocation Objects

@_attrs_define
class ModelInvocation()

Model invocation response object returned from a CustomModel.invoke method or as an element of a list returned from a CustomBatchModel.invoke_batch method.

Arguments:

model_prediction - Prediction from the model to be used when running the evaluation, e.g. predicted class from classification model or generated text completion from a generative model. This would typically be parsed out of the overall model_output_metadata.
model_input - All the input sent to the model.
model_output_metadata - Full model response, including any metadata returned with model's output.
tool_calls - List of tool calls made during the model invocation, if any.

OpenAIModel Objects

@define
class OpenAIModel(BaseModel)

An OpenAI model definition with prompt template and relevant parameters for an Okareo evaluation.

Arguments:

model_id - Model ID to request from OpenAI completion. For list of available models, see https://platform.openai.com/docs/models
temperature - Parameter for controlling the randomness of the model's output.
system_prompt_template - System role prompt template to pass to the model. Uses mustache syntax for variable substitution, e.g. {scenario_input}.
user_prompt_template - User role prompt template to pass to the model. Uses mustache syntax for variable substitution, e.g. {scenario_input}
dialog_template - Dialog template in OpenAI message format to pass to the model. Uses mustache syntax for variable substitution.
tools - List of tools to pass to the model.

GenerationModel Objects

@define
class GenerationModel(BaseModel)

An LLM definition with prompt template and relevant parameters for an Okareo evaluation.

Arguments:

model_id - Model ID to request for LLM completion.
temperature - Parameter for controlling the randomness of the model's output.
system_prompt_template - System role prompt template to pass to the model. Uses mustache syntax for variable substitution, e.g. {scenario_input}.
user_prompt_template - User role prompt template to pass to the model. Uses mustache syntax for variable substitution, e.g. {scenario_input}
dialog_template - Dialog template in OpenAI message format to pass to the model. Uses mustache syntax for variable substitution.
tools - List of tools to pass to the model.

OpenAIAssistantModel Objects

@_attrs_define
class OpenAIAssistantModel(BaseModel)

An OpenAI Assistant definition with prompt template and relevant parameters for an Okareo evaluation.

Arguments:

model_id - Assistant ID to request to run a thread against.
assistant_prompt_template - System role prompt template to pass to the model. Uses mustache syntax for variable substitution, e.g. {scenario_input}.
user_prompt_template - User role prompt template to pass to the model. Uses mustache syntax for variable substitution, e.g. {scenario_input}
dialog_template - Dialog template in OpenAI message format to pass to the model. Uses mustache syntax for variable substitution.

CohereModel Objects

@_attrs_define
class CohereModel(BaseModel)

A Cohere model definition with prompt template and relevant parameters for an Okareo evaluation.

Arguments:

model_id - Model ID to request for the Cohere completion. For a full list of available models, see https://docs.cohere.com/v2/docs/models
model_type - Type of application for the Cohere model. Currently, we support 'classify' and 'embed'.
input_type - Input type for the Cohere embedding model. For more details, see https://docs.cohere.com/v2/docs/embeddings#the-input_type-parameter

PineconeDb Objects

@_attrs_define
class PineconeDb(BaseModel)

A Pinecone vector database configuration for use in an Okareo retrieval evaluation.

Arguments:

index_name - The name of the Pinecone index to connect to.
region - The region where the Pinecone index is hosted.
project_id - The project identifier associated with the Pinecone index.
top_k - The number of top results to retrieve for queries. Defaults to 5.

QdrantDB Objects

@_attrs_define
class QdrantDB(BaseModel)

A Qdrant vector database configuration for use in an Okareo retrieval evaluation.

Arguments:

collection_name - The name of the Qdrant collection to connect to.
url - The URL of the Qdrant instance.
top_k - The number of top results to retrieve for queries. Defaults to 5.
sparse - Whether to use sparse vectors for the Qdrant collection. Defaults to False.

CustomModel Objects

@_attrs_define
class CustomModel(BaseModel)

A custom model definition for an Okareo evaluation. Requires a valid invoke definition that operates on a single input.

Arguments:

name - A name for the custom model.

invoke

@abstractmethod
def invoke(input_value: Union[dict, list, str]) -> Union[ModelInvocation, Any]

Method for taking a single scenario input and returning a single model output

Arguments:

input_value - Union[dict, list, str] - input to the model.

Returns:

Union[ModelInvocation, Any] - model output. If the model returns a ModelInvocation, it should contain the model's prediction, input, and metadata. If the model returns a tuple, the first element should be the model's prediction and the second element should be the metadata.

CustomMultiturnTarget Objects

@_attrs_define
class CustomMultiturnTarget(BaseModel)

A custom model definition for an Okareo multiturn evaluation. Requires a valid invoke definition that operates on a single turn of a converstation.

start_session

def start_session(
    scenario_input: str | None = None
) -> tuple[str | None, ModelInvocation | None]

Method for starting a multiturn conversation with a custom model

Returns:

str | None: session_id - the ID of the session started by the model.
ModelInvocation | None: model output - the model's response to the session start, if any.

end_session

def end_session(session_id: str) -> None

Method for ending a multiturn conversation with a custom model

Arguments:

session_id - str - the ID of the session to end.

invoke

@abstractmethod
def invoke(messages: List[dict[str, str]],
           scenario_input: Optional[Union[dict, list, str]] = None,
           session_id: Optional[str] = None) -> Union[ModelInvocation, Any]

Method for continuing a multiturn conversation with a custom model

Arguments:

messages - list - list of messages in the conversation
scenario_input - Optional[dict | list | str] - scenario input for the conversation

Returns:

CustomMultiturnTargetAsync Objects

@_attrs_define
class CustomMultiturnTargetAsync(BaseModel)

A custom model definition for an Okareo multiturn evaluation that uses asynchronous methods. Requires a valid invoke definition that operates on a single turn of a converstation.

start_session

async def start_session(
    scenario_input: str | None = None
) -> tuple[str | None, ModelInvocation | None]

Method for starting a multiturn conversation with a custom model

Returns:

str | None: session_id - the ID of the session started by the model.
ModelInvocation | None: model output - the model's response to the session start, if any.

end_session

async def end_session(session_id: str) -> None

Method for ending a multiturn conversation with a custom model

Arguments:

session_id - str - the ID of the session to end.

invoke

@abstractmethod
async def invoke(
    messages: List[dict[str, str]],
    scenario_input: Optional[Union[dict, list, str]] = None,
    session_id: Optional[str] = None
) -> Awaitable[Union[ModelInvocation, Any]]

Method for continuing a multiturn conversation with a custom model

Arguments:

messages - list - list of messages in the conversation
scenario_input - Optional[dict | list | str] - scenario input for the conversation

Returns:

VoiceTarget Objects

class VoiceTarget(BaseModel)

Base class for realtime voice targets in Okareo multiturn evaluation.

This target runs server-side during multiturn evaluations and follows the same API key pattern as other targets: API keys are passed via the api_keys parameter in run_simulation().

OpenAIVoiceTarget Objects

@_attrs_define
class OpenAIVoiceTarget(VoiceTarget)

OpenAI Realtime API voice target for Okareo multiturn evaluation.

Arguments:

model - Model ID for OpenAI Realtime. Default: "gpt-realtime".
instructions - System instructions for the voice agent. Default: "Be brief and helpful."
output_voice - Voice ID for TTS output. Options: "alloy", "echo", "fable", "onyx", "nova", "shimmer".

DeepgramVoiceTarget Objects

@_attrs_define
class DeepgramVoiceTarget(VoiceTarget)

Deepgram voice target for Okareo multiturn evaluation.

Arguments:

model - Model ID for Deepgram. Default: "aura-2".
instructions - System instructions for the voice agent. Default: "Be brief and helpful."
output_voice - Voice ID for TTS output. Example: "aura-2-thalia-en".

TwilioVoiceTarget Objects

@_attrs_define
class TwilioVoiceTarget(VoiceTarget)

Twilio voice target for Okareo multiturn evaluation.

Arguments:

account_sid - Twilio account SID for authentication.
auth_token - Twilio authentication token.
from_phone_number - Phone number to call from (Twilio number).
to_phone_number - Phone number to call to (destination number).
max_parallel_requests - Maximum number of parallel requests the target can handle.

StopConfig Objects

@define
class StopConfig()

Configuration for stopping a multiturn conversation based on a specific check.

Arguments:

check_name - Name of the check to use for stopping the conversation.
stop_on - The check condition to stop the conversation. Defaults to True (i.e., conversation stops when check evaluates to True).

SessionConfig Objects

class SessionConfig()

Configuration for a custom API endpoint that starts a session.

Arguments:

url - URL of the endpoint to start the session.
method - HTTP method to use for the request. Defaults to POST.
headers - Headers to include in the request. Defaults to an empty JSON object.
body - Body to include in the request. Defaults to an empty JSON object.
status_code - Expected HTTP status code of the response.
response_session_id_path - Path to extract the session ID from the response. E.g., response.id will use the id field of the response JSON object to set the session_id.

TurnConfig Objects

class TurnConfig()

Configuration for a custom API endpoint that continues a session/conversation by one turn.

Arguments:

url - URL of the endpoint to start the session.
method - HTTP method to use for the request. Defaults to POST.
headers - Headers to include in the request. Supports mustache syntax for variable substitution for {latest_message}, {message_history}, {session_id}. Defaults to an empty JSON object.
body - Body to include in the request. Supports mustache syntax for variable substitution for {latest_message}, {message_history}, {session_id}. Defaults to an empty JSON object.
method1 - Expected HTTP status code of the response.
method2 - Path to extract the model's generated message from the response. E.g., method3 will parse out the corresponding field of the response JSON object as the model's generated response.
method4 - Path to extract tool calls from the response.

EndSessionConfig Objects

class EndSessionConfig()

Configuration for a custom API endpoint that ends a session.

Arguments:

url - URL of the endpoint to start the session.
method - HTTP method to use for the request. Defaults to POST.
headers - Headers to include in the request. Defaults to an empty JSON object.
body - Body to include in the request. Defaults to an empty JSON object.
status_code - Expected HTTP status code of the response.
response_session_id_path - Path to extract the session ID from the response.

CustomEndpointTarget Objects

class CustomEndpointTarget(BaseModel)

A trio of custom API endpoints for starting a session and continuing a conversation to use in Okareo multiturn evaluation.

Arguments:

start_session - A valid SessionConfig for starting a session.
next_turn - A valid TurnConfig for requesting and parsing the next turn of a conversation.
end_session - A valid EndSessionConfig for ending a session.
max_parallel_requests - Maximum number of parallel requests to allow when running the evaluation.

MultiTurnDriver Objects

@_attrs_define
class MultiTurnDriver(BaseModel)

A driver model for Okareo multiturn evaluation.

Arguments:

target - Target model under test to use in the multiturn evaluation.
stop_check - A valid StopConfig or a dict that can be converted to StopConfig.
driver_model_id - Model ID to use for the driver model (e.g., "gpt-4.1").
driver_temperature - Parameter for controlling the randomness of the driver model's output.
repeats - Number of times to run a conversation per scenario row. Defaults to 1.
max_turns - Maximum number of turns to run in a conversation. Defaults to 5.
first_turn - Name of model (i.e., "target" or "driver") that should initiate each conversation. Defaults to "target".
driver_prompt_template - Optional system prompt template to pass to the driver model. Uses mustache syntax for variable substitution, e.g. {input}.

CustomBatchModel Objects

@_attrs_define
class CustomBatchModel(BaseModel)

A custom batch model definition for an Okareo evaluation. Requires a valid invoke_batch definition that operates on a single input.

invoke_batch

@abstractmethod
def invoke_batch(
    input_batch: list[dict[str, Union[dict, list, str]]]
) -> list[dict[str, Union[ModelInvocation, Any]]]

Method for taking a batch of scenario inputs and returning a corresponding batch of model outputs

Arguments:

input_batch - list[dict[str, Union[dict, list, str]]] - batch of inputs to the model. Expects a list of dicts of the format { 'id': str, 'input_value': Union[dict, list, str] }.

Returns:

List of dicts of format { 'id': str, 'model_invocation': Union[ModelInvocation, Any] }. 'id' must match the corresponding input_batch element's 'id'.

ModelUnderTest Objects​

submit_test​

run_test​

get_test_run​

ModelInvocation Objects​

OpenAIModel Objects​

GenerationModel Objects​

OpenAIAssistantModel Objects​

CohereModel Objects​

PineconeDb Objects​

QdrantDB Objects​

CustomModel Objects​

invoke​

CustomMultiturnTarget Objects​

start_session​

end_session​

invoke​

CustomMultiturnTargetAsync Objects​

start_session​

end_session​

invoke​

VoiceTarget Objects​

OpenAIVoiceTarget Objects​

DeepgramVoiceTarget Objects​

TwilioVoiceTarget Objects​

StopConfig Objects​

SessionConfig Objects​

TurnConfig Objects​

EndSessionConfig Objects​

CustomEndpointTarget Objects​

MultiTurnDriver Objects​

CustomBatchModel Objects​

invoke_batch​

ModelUnderTest Objects

submit_test

run_test

get_test_run

ModelInvocation Objects

OpenAIModel Objects

GenerationModel Objects

OpenAIAssistantModel Objects

CohereModel Objects

PineconeDb Objects

QdrantDB Objects

CustomModel Objects

invoke

CustomMultiturnTarget Objects

start_session

end_session

invoke

CustomMultiturnTargetAsync Objects

start_session

end_session

invoke

VoiceTarget Objects

OpenAIVoiceTarget Objects

DeepgramVoiceTarget Objects

TwilioVoiceTarget Objects

StopConfig Objects

SessionConfig Objects

TurnConfig Objects

EndSessionConfig Objects

CustomEndpointTarget Objects

MultiTurnDriver Objects

CustomBatchModel Objects

invoke_batch