Judge Checks
Judge Checks
A Judge Check uses a prompt template to evaluate the generated response. It's particularly useful when you want to leverage an existing language model to perform the evaluation.
Custom Judge Checks
Here's an example of how to create a Judge Check with the Python SDK:
check_sample_score = okareo.create_or_update_check(
name="check_sample_score",
description="Check sample score",
check=ModelBasedCheck(
prompt_template="Only output the number of words in the following text: {input} {result} {generation}",
check_type=CheckOutputType.SCORE,
),
)
In this example:
name
: A unique identifier for the check.description
: A brief description of what the check does.check
: An instance ofModelBasedCheck
.prompt_template
: A string that includes placeholders (input, output, generation) which will be replaced with actual values.check_type
: Specifies the type of output (SCORE or PASS_FAIL).
The prompt_template
should include at least one of the following placeholders:
generation
: corresponds to the model's outputinput
: corresponds to the scenario inputresult
: corresponds to the scenario result
The check_type
should be one of:
CheckOutputType.SCORE
: The template should prompt the model for a score (single number)CheckOutputType.PASS_FAIL
: The template should prompt the model for a boolean value (True/False)
Predefined judge checks
In Okareo, we provide out-of-the-box judge checks to let you quickly assess your LLM's performance. In the Okareo SDK, you can list the available checks by running the following method:
- Typescript
- Python
okareo.get_all_checks()
okareo.get_all_checks()
To use any of these checks, you simply specify them when running an evaluation as follows:
- Typescript
- Python
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
// assume that "scenario" is a ScenarioSetResponse object or a UUID
const eval_results: any = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: 'Evaluation Name',
tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario_id: scenario_id,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks,
} as RunTestProps);
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
name="Evaluation Name",
scenario=scenario,
project_id=project_id,
test_run_type=TestRunType.NL_GENERATION,
calculate_metrics=True,
checks=checks
)
The remainder of this guide details different categories of out-of-the box judge checks that are available in Okareo.
Agent Behavioral Checks
The following checks are used to assess different agentic behaviors and instruction-following capabilities.
Behavior Adherence
Name: behavior_adherence
A measurement of how well the model_output follows instructions, directives, or commands outlined in the scenario_result. This check works best when the scenario_result contains a description of expected model behaviors. This check is not concerned with factual accuracy, but rather adherence to model instructions. Output is a pass/fail (1 or 0).
Loop Guard
Name: loop_guard
Evaluates whether conversations between AI agents are getting stuck in repetitive patterns (loops) without making meaningful progress.
Model Refusal
Name: model_refusal
A measurement of whether or not the model_output refuses to respond to the user's question. Output should be pass/fail (1 or 0).
Task Completed
Name: task_completed
A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Output is 1 if model output fulfills the task specified by the latest 'user' message in the message list. Otherwise, output is 0.
Function Call Checks
These checks let you assess the quality of invoked function calls by your LLM/agent.
Function Result Present
Name: function_result_present
A pass/fail check that evaluates tool messages in a conversation for valid outputs. This check examines messages with a 'tool' role and verifies their outputs are not null, none, or error responses.
Function Parameter Accuracy
Name: function_parameter_accuracy
Ensures extracted values from the dialog are correct and match expectations.
Function Call Validator
Name: function_call_validator
Evaluates whether the correct tool is selected and executed in the proper sequence.
Function Call Consistency
Name: function_call_consistency
A pass/fail check on a function call compared to the model input. This check works for a model input containing either a single message or a list of messages. Output is 1 if one or more function calls are present and each function call's name and parameter(s) are consistent with the model input. Otherwise, output is 0.
RAG Checks
The following checks can be used to assess RAG completions/conversations.
Context Relevance
Name: context_relevance
A binary evaluation of whether the model input has relevant context. Returns True if the context is relevant to the question, False otherwise.
Context Consistency
Name: context_consistency
A binary evaluation of whether the model output is fully consistent with the model input. This check assesses if the generated output aligns with and is supported by the given input, without contradicting or misrepresenting any information provided. It focuses solely on the consistency between input and output, ignoring external facts. Returns True if consistent, False if any inconsistency is found.
Reference Similariy
Name: reference_similarity
A measurement of how similar the model_output is to the scenario_result. The model_output would be the generated answer from the model, and the scenario_result would be the expected answer. Ranges from 1 to 5.
Summarization checks
The following checks make use of an LLM judge. The judge is provided with a system prompt describing the check in question. Each check is a measure of quality of the natural language text rated on an integer scale ranging from 1 to 5.
Any Summarization check that includes _summary
in its name makes use of the scenario_input
in addition to the model_output
.
Conciseness
Name: conciseness
.
The conciseness check rates the text on how concise the generated output is. If the model's output contains repeated ideas, the score will be lower.
Uniqueness
Name: uniqueness
.
The uniqueness check will rate the text on how unique the text is compared to the other outputs in the evaluation. Consequently, this check uses all the rows in the scenario to score each row individually.
Fluency
Names: fluency
, fluency_summary
.
The fluency check is a measure of quality based on grammar, spelling, punctuation, word choice, and sentence structure. This check does not require a scenario input or result.
Coherence
Names: coherence
, coherence_summary
.
The coherence check is a measurement of structure and organization in a model's output. A higher score indicates that the output is well-structured and organized, and a lower score indicates the opposite.
Consistency
Names: consistency
, consistency_summary
.
The consistency check is a measurement of factual accuracy between the model output and the scenario input. This is useful in summarization tasks, where the model's input is a target document and the model's output is a summary of the target document.
Relevance
Names: relevance
, relevance_summary
.
The relevance check is a measure of summarization quality that rewards highly relevant information and penalizes redundancies or irrelevant information.
Code Generation Checks
These checks evaluate an agent that is generating code that compiles/executes.
Is Code Functional
Name: is_code_functional
A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Assumes the output contains generated code and the input contains a list of messages describing the desired code or modification to a given code snippet. Output is 1 if the code is functional, meaning that the code contains no logical errors, undefined variables/functions, or incomplete statements. Otherwise, output is 0.
Classification Checks
These checks allow you to evaluate agents that are attempting to categories/classify input messages.
Is Best Option
Name: is_best_option
A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Assumes the list of messages contains a list of options (e.g., labels, actions, agents, etc.) to choose from. Output is 1 if the model output contains the best option from the list. Otherwise, output is 0.
Data Quality Checks
These checks are useful in improving the quality synthetic data generation. See Data Quality Checks for more details.
Reverse QA Quality
Name: reverse_qa_quality
A measurement of the quality of a generated question in the model_output with respect to the context contained in the scenario_input. Used primarily with Okareo's reverse question data generators. Ranges from 1 to 5.
Rephrase Quality
Name: rephrase_quality
A measurement of the quality of a rephrased text in the model_output with respect to a source text contained in the scenario_input. Used primarily with Okareo's rephrase data generators. Ranges from 1 to 5.