Model Checks
A Model Check uses a prompt template to evaluate the generated response. At runtime, a judge LLM reads your prompt -- with template variables replaced by actual evaluation data -- and returns a score or pass/fail result. Model checks are useful when evaluation requires understanding, reasoning, or subjective judgment that cannot be captured in deterministic code.
Okareo provides many predefined model checks out of the box. You can also write your own custom model checks.
Custom Model Checks
To create a custom model check, write a prompt template -- a set of instructions for a judge LLM -- and register it with Okareo.
Here's an example using the Python SDK:
from okareo_api_client.models import CheckOutputType
check = okareo.create_or_update_check(
name="adherence_check",
description="Check if model output follows the instructions in the context",
check=ModelBasedCheck(
prompt_template="""You will be given a Context and a Model Output.
Your task is to rate the Model Output on one metric: Adherence.
Evaluation Criteria:
Adherence (0 or 1) - whether the Model Output follows the instructions in the Context.
Evaluation Steps:
1. Read the Context and identify the instructions.
2. Read the Model Output and check if it follows those instructions.
3. Assign a score: 1 if adherent, 0 if not.
Context:
{scenario_result}
Model Output:
{generation}
Adherence (0 or 1 in double brackets, explanation in double parentheses):""",
check_type=CheckOutputType.PASS_FAIL,
),
)
In this example:
name: A unique identifier for the check.description: A brief description of what the check does.prompt_template: The prompt template with template variables (e.g.{generation},{scenario_result}) that the runtime replaces with actual data. This prompt instructs a judge LLM on how to evaluate.check_type: The output type —CheckOutputType.PASS_FAIL(0 or 1) orCheckOutputType.SCORE(numeric scale, e.g. 1–5).
Template Variables
Your prompt must include at least one template variable where the runtime injects real data. Without one, the judge has nothing to evaluate.
| Variable | What the runtime injects |
|---|---|
{generation} or {model_output} | The model output being evaluated. |
{input} or {scenario_input} | The scenario/source text. |
{result} or {scenario_result} | The reference/expected output. |
{model_input} | What was sent to the model (prompt or messages). |
{message_history} | Full conversation (model_input messages + assistant model_output). |
{tool_calls} | Model's tool/function calls from the result. |
{tools} | Tool definitions/schema from input. |
{simulation_message_history} | Simulation message history from metadata. |
{audio_messages} | Audio input messages. |
{audio_output} | Audio model output. |
Use the template variables that fit your check:
- Summarization (compare output to source): use
{input}(or{scenario_input}) and{generation}. - Adherence (follow instructions/context): use
{result}(or{scenario_result}) and{generation}. - Task completion (full exchange): use
{model_input}and{generation}. - Function calling: use
{model_input},{generation},{tool_calls}, and{tools}.
Do not invent new template variable names (e.g. {request}, {history}, {final_message}). Only the variables listed above are substituted at runtime; any other {name} will be left as-is in the prompt and may confuse the judge.
Writing a Good Prompt
A well-structured judge prompt typically includes:
- Task / context — What the judge is given and what to rate (one metric).
- Evaluation criteria — Metric name, scale (e.g. 0/1 or 1–5), and a short definition.
- Evaluation steps — Numbered instructions for the judge to follow.
- Examples — At least one input/output example showing the expected format.
- Input blocks — Labeled sections with template variables (e.g. "Source Text:
{scenario_input}"). - Output format — How the judge should respond.
Output format matters. For pass/fail checks, instruct the judge to respond with 0 or 1 in double brackets (e.g. [[1]]). For score checks, use a numeric scale in double brackets (e.g. [[4]]). Optionally ask for an explanation in double parentheses (e.g. ((The summary is coherent.))).
Predefined Model Checks
Okareo provides the following predefined model checks. You can list all available checks in the SDK:
- Typescript
- Python
javascript okareo.get_all_checks()
python okareo.get_all_checks()
To use any of these checks, specify them by name when running an evaluation:
- Typescript
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
// assume that "scenario" is a ScenarioSetResponse object or a UUID
const eval_results: any = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: 'Evaluation Name',
tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario_id: scenario_id,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks,
} as RunTestProps);
`</TabItem>>
<TabItem value="python" label="Python">
`python {1,10} showLineNumbers
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
name="Evaluation Name",
scenario=scenario,
project_id=project_id,
test_run_type=TestRunType.NL_GENERATION,
calculate_metrics=True,
checks=checks
)
Agent Behavioral Checks
The following checks assess agentic behaviors and instruction-following capabilities.
Behavior Adherence
Name: behavior_adherence
A measurement of how well the model output follows instructions, directives, or commands outlined in the scenario result. This check works best when the scenario result contains a description of expected model behaviors. Focuses on adherence to instructions, not factual accuracy. Output: pass/fail (0 or 1).
Task Completed
Name: task_completed
A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Output is 1 if the model output fulfills the task specified by the latest "user" message in the message list. Otherwise, output is 0.
Result Completed
Name: result_completed
A pass/fail check that compares the conversation message history in the model input/model output against the expected outcome described in the scenario result. Output is 1 if the model output fulfills the outcome specified by the scenario result. Otherwise, output is 0.
Model Refusal
Name: model_refusal
A measurement of whether the model output refuses to respond to the user's question. Output: pass/fail (0 or 1).
Loop Guard
Name: loop_guard
Evaluates whether conversations between AI agents are getting stuck in repetitive patterns (loops) without making meaningful progress. Output: pass/fail (0 or 1).
Automated Resolution
Name: automated_resolution
Evaluates whether the agent escalated the conversation to a third party or resolved it autonomously. Output: pass/fail (0 or 1).
Response Consistency
Name: response_consistency
Checks each assistant message against the entire live dialog state for consistency. Detects contradictions or information drift across turns. Output: pass/fail (0 or 1).
Response Loop
Name: response_loop
Scans each assistant turn against the dialog to detect repeated content with no new information or action. Output: pass/fail (0 or 1).
Simulation Trace Consistency
Name: simulation_trace_consistency
Compares trace message history captured via OpenTelemetry with the originating simulation dialog to ensure they are consistent. Output: pass/fail (0 or 1).
Function Call Checks
These model-based checks use a judge LLM to evaluate function call quality. For deterministic code-based function call checks (AST validation, parameter matching), see Code Checks -- Function Call Checks.
Function Call Consistency
Name: function_call_consistency
A pass/fail check on a function call compared to the model input. Works for a model input containing either a single message or a list of messages. Output is 1 if one or more function calls are present and each function call's name and parameters are consistent with the model input. Otherwise, output is 0.
Function Call Validator
Name: function_call_validator
Evaluates whether the correct tool is selected and executed in the proper sequence. Output: pass/fail (0 or 1).
Function Parameter Accuracy
Name: function_parameter_accuracy
Ensures extracted values from the dialog are correct and match expectations. Output: pass/fail (0 or 1).
Function Result Present
Name: function_result_present
A pass/fail check that evaluates tool messages in a conversation for valid outputs. Examines messages with a "tool" role and verifies their outputs are not null, none, or error responses. Output: pass/fail (0 or 1).
RAG Checks
The following checks assess RAG (Retrieval-Augmented Generation) completions and conversations.
Context Relevance
Name: context_relevance
A measurement of whether the model input has relevant context. Scores from 1 to 5.
Context Consistency
Name: context_consistency
A binary evaluation of whether the model output is fully consistent with the model input. Assesses if the generated output aligns with and is supported by the given input, without contradicting or misrepresenting any information provided. Focuses solely on consistency between input and output, ignoring external facts. Output: pass/fail (0 or 1).
Reference Similarity
Name: reference_similarity
A measurement of how similar the model output is to the scenario result. The model output is the generated answer, and the scenario result is the expected answer. Scores from 1 to 5.
Faithfulness
Name: faithfulness
Checks whether the generated response (model output) is factually consistent with the provided context (scenario result). Scores from 1 to 5.
Summarization Checks
The following model checks use a judge LLM to assess the quality of generated summaries. Each produces an integer score ranging from 1 to 5.
Fluency
Name: fluency_summary
A measurement of quality based on grammar, spelling, punctuation, word choice, and sentence structure. Scores from 1 to 5.
Coherence
Name: coherence_summary
A measurement of structure and organization in the model's output. A higher score indicates a well-structured and well-organized summary. Scores from 1 to 5.
Consistency
Name: consistency_summary
A measurement of factual accuracy between the model output and the model input. Useful in summarization tasks where the model's input is a target document and the output is a summary. Scores from 1 to 5.
Relevance
Name: relevance_summary
A measurement of how much important information from the scenario input is contained in the model output. Rewards highly relevant information and penalizes redundancies or irrelevant content. Scores from 1 to 5.
Safety & Quality Checks
Toxicity
Name: toxicity
Evaluates the toxicity level of model responses, identifying harmful, offensive, abusive, or inappropriate language. Scores from 1 to 5, where 5 is completely non-toxic and 1 is highly toxic. Focuses only on toxicity, not relevance or helpfulness.
Fairness
Name: fairness
Checks whether the generated response is unbiased with respect to individual attributes such as gender, race, age, etc. Scores from 1 to 5.
Voice / Audio Checks
Empathy Score
Name: empathy_score
Evaluates whether the agent's voice indicates an empathetic tone. Designed for audio-based evaluations. Scores from 1 to 5.
Code Generation Checks
These model-based checks evaluate agents that generate code.
Is Code Functional
Name: is_code_functional
A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Assumes the output contains generated code and the input contains a list of messages describing the desired code or modification to a given code snippet. Output is 1 if the code is functional (no logical errors, undefined variables, or incomplete statements). Otherwise, output is 0.
Classification Checks
Is Best Option
Name: is_best_option
A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Assumes the list of messages contains a list of options (e.g. labels, actions, agents) to choose from. Output is 1 if the model output contains the best option from the list. Otherwise, output is 0.
Data Quality Checks
These checks are useful for improving the quality of synthetic data generation. See Data Quality Checks for more details.
Reverse QA Quality
Name: reverse_qa_quality
A measurement of the quality of a generated question in the model output with respect to the context contained in the scenario input. Used primarily with Okareo's reverse question data generators. Scores from 1 to 5.
Rephrase Quality
Name: rephrase_quality
A measurement of the quality of a rephrased text in the model output with respect to a source text contained in the scenario input. Used primarily with Okareo's rephrase data generators. Scores from 1 to 5.