Judge Checks

A Judge Check uses a prompt template to evaluate the generated response. It's particularly useful when you want to leverage an existing language model to perform the evaluation. Judge Checks can optionally return an Explanation when prompted to do so.

Custom Judge Checks

Here's an example of how to create a Judge Check with the Python SDK:

check_sample_score = okareo.create_or_update_check(
    name="check_sample_score",
    description="Check sample score",
    check=ModelBasedCheck(
        prompt_template=(
            """Return a Score for the Text based on the number of nouns it contains.
            Provide an Explanation listing the nouns and their corresponding counts.
        
            # Test:
            {input} {result} {generation}."""
        ),
        check_type=CheckOutputType.SCORE,
    ),
)

In this example:

name: A unique identifier for the check.
description: A brief description of what the check does.
check: An instance of ModelBasedCheck.
- prompt_template: A string that includes placeholders (input, output, generation) which will be replaced with actual values.
- check_type: Specifies the type of output (SCORE or PASS_FAIL).

The prompt_template should include at least one of the following placeholders:

generation: corresponds to the model's output
input: corresponds to the scenario input
result: corresponds to the scenario result

The check_type should be one of:

CheckOutputType.SCORE: The template should prompt the model for a score (single number)
CheckOutputType.PASS_FAIL: The template should prompt the model for a boolean value (True/False)

Predefined judge checks

In Okareo, we provide out-of-the-box judge checks to let you quickly assess your LLM's performance. In the Okareo SDK, you can list the available checks by running the following method:

Typescript
Python

okareo.get_all_checks()

okareo.get_all_checks()

To use any of these checks, you simply specify them when running an evaluation as follows:

Typescript
Python

checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]

// assume that "scenario" is a ScenarioSetResponse object or a UUID
const eval_results: any = await model.run_test({
    model_api_key: OPENAI_API_KEY,
    name: 'Evaluation Name',
    tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
    project_id: project_id,
    scenario_id: scenario_id,
    calculate_metrics: true,
    type: TestRunType.NL_GENERATION,
    checks: checks,
} as RunTestProps);

checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]

# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
    name="Evaluation Name",
    scenario=scenario,
    project_id=project_id,
    test_run_type=TestRunType.NL_GENERATION,
    calculate_metrics=True,
    checks=checks
)

The remainder of this guide details different categories of out-of-the box judge checks that are available in Okareo.

Agent Behavioral Checks

The following checks are used to assess different agentic behaviors and instruction-following capabilities.

Behavior Adherence

Name: behavior_adherence

A measurement of how well the model_output follows instructions, directives, or commands outlined in the scenario_result. This check works best when the scenario_result contains a description of expected model behaviors. This check is not concerned with factual accuracy, but rather adherence to model instructions. Output is a pass/fail (1 or 0).

Loop Guard

Name: loop_guard

Evaluates whether conversations between AI agents are getting stuck in repetitive patterns (loops) without making meaningful progress.

Model Refusal

Name: model_refusal

A measurement of whether or not the model_output refuses to respond to the user's question. Output should be pass/fail (1 or 0).

Task Completed

Name: task_completed

A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Output is 1 if model output fulfills the task specified by the latest 'user' message in the message list. Otherwise, output is 0.

Result Completed

Name: result_completed

A pass/fail check that compares the conversation message history in the model input/model output against the expected outcome described in the scenario_result. Output is 1 if model output fulfills the outcome specified by the scenario_result. Otherwise, output is 0.

Function Call Checks

These checks let you assess the quality of invoked function calls by your LLM/agent.

Function Result Present

Name: function_result_present

A pass/fail check that evaluates tool messages in a conversation for valid outputs. This check examines messages with a 'tool' role and verifies their outputs are not null, none, or error responses.

Function Parameter Accuracy

Name: function_parameter_accuracy

Ensures extracted values from the dialog are correct and match expectations.

Function Call Validator

Name: function_call_validator

Evaluates whether the correct tool is selected and executed in the proper sequence.

Function Call Consistency

Name: function_call_consistency

A pass/fail check on a function call compared to the model input. This check works for a model input containing either a single message or a list of messages. Output is 1 if one or more function calls are present and each function call's name and parameter(s) are consistent with the model input. Otherwise, output is 0.

RAG Checks

The following checks can be used to assess RAG completions/conversations.

Context Relevance

Name: context_relevance

A binary evaluation of whether the model input has relevant context. Returns True if the context is relevant to the question, False otherwise.

Context Consistency

Name: context_consistency

A binary evaluation of whether the model output is fully consistent with the model input. This check assesses if the generated output aligns with and is supported by the given input, without contradicting or misrepresenting any information provided. It focuses solely on the consistency between input and output, ignoring external facts. Returns True if consistent, False if any inconsistency is found.

Reference Similariy

Name: reference_similarity

A measurement of how similar the model_output is to the scenario_result. The model_output would be the generated answer from the model, and the scenario_result would be the expected answer. Ranges from 1 to 5.

Summarization Checks

The following checks make use of an LLM judge. The judge is provided with a system prompt describing the check in question. Each check is a measure of quality of the natural language text rated on an integer scale ranging from 1 to 5.

note

Any Summarization check that includes _summary in its name makes use of the scenario_input in addition to the model_output.

Conciseness

Name: conciseness.

The conciseness check rates the text on how concise the generated output is. If the model's output contains repeated ideas, the score will be lower.

Uniqueness

Name: uniqueness.

The uniqueness check will rate the text on how unique the text is compared to the other outputs in the evaluation. Consequently, this check uses all the rows in the scenario to score each row individually.

Fluency

Names: fluency, fluency_summary.

The fluency check is a measure of quality based on grammar, spelling, punctuation, word choice, and sentence structure. This check does not require a scenario input or result.

Coherence

Names: coherence, coherence_summary.

The coherence check is a measurement of structure and organization in a model's output. A higher score indicates that the output is well-structured and organized, and a lower score indicates the opposite.

Consistency

Names: consistency, consistency_summary.

The consistency check is a measurement of factual accuracy between the model output and the scenario input. This is useful in summarization tasks, where the model's input is a target document and the model's output is a summary of the target document.

Relevance

Names: relevance, relevance_summary.

The relevance check is a measure of summarization quality that rewards highly relevant information and penalizes redundancies or irrelevant information.

Code Generation Checks

These checks evaluate an agent that is generating code that compiles/executes.

Is Code Functional

Name: is_code_functional

A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Assumes the output contains generated code and the input contains a list of messages describing the desired code or modification to a given code snippet. Output is 1 if the code is functional, meaning that the code contains no logical errors, undefined variables/functions, or incomplete statements. Otherwise, output is 0.

Classification Checks

These checks allow you to evaluate agents that are attempting to categories/classify input messages.

Is Best Option

Name: is_best_option

A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Assumes the list of messages contains a list of options (e.g., labels, actions, agents, etc.) to choose from. Output is 1 if the model output contains the best option from the list. Otherwise, output is 0.

Data Quality Checks

These checks are useful in improving the quality synthetic data generation. See Data Quality Checks for more details.

Reverse QA Quality

Name: reverse_qa_quality

A measurement of the quality of a generated question in the model_output with respect to the context contained in the scenario_input. Used primarily with Okareo's reverse question data generators. Ranges from 1 to 5.

Rephrase Quality

Name: rephrase_quality

A measurement of the quality of a rephrased text in the model_output with respect to a source text contained in the scenario_input. Used primarily with Okareo's rephrase data generators. Ranges from 1 to 5.

Judge Checks​

Custom Judge Checks​

Predefined judge checks

Agent Behavioral Checks​

Behavior Adherence​

Loop Guard​

Model Refusal​

Task Completed​

Result Completed​

Function Call Checks​

Function Result Present​

Function Parameter Accuracy​

Function Call Validator​

Function Call Consistency​

RAG Checks​

Context Relevance​

Context Consistency​

Reference Similariy​

Summarization Checks​

Conciseness​

Uniqueness​

Fluency​

Coherence​

Consistency​

Relevance​

Code Generation Checks​

Is Code Functional​

Classification Checks​

Is Best Option​

Data Quality Checks​

Reverse QA Quality​

Rephrase Quality​