Skip to main content

Model Checks

A Model Check uses a prompt template to evaluate the generated response. At runtime, a judge LLM reads your prompt -- with template variables replaced by actual evaluation data -- and returns a score or pass/fail result. Model checks are useful when evaluation requires understanding, reasoning, or subjective judgment that cannot be captured in deterministic code.

Okareo provides many predefined model checks out of the box. You can also write your own custom model checks.


Custom Model Checks

To create a custom model check, write a prompt template -- a set of instructions for a judge LLM -- and register it with Okareo.

Here's an example using the Python SDK:

from okareo_api_client.models import CheckOutputType

check = okareo.create_or_update_check(
name="adherence_check",
description="Check if model output follows the instructions in the context",
check=ModelBasedCheck(
prompt_template="""You will be given a Context and a Model Output.
Your task is to rate the Model Output on one metric: Adherence.

Evaluation Criteria:
Adherence (0 or 1) - whether the Model Output follows the instructions in the Context.

Evaluation Steps:
1. Read the Context and identify the instructions.
2. Read the Model Output and check if it follows those instructions.
3. Assign a score: 1 if adherent, 0 if not.

Context:
{scenario_result}

Model Output:
{generation}

Adherence (0 or 1 in double brackets, explanation in double parentheses):""",
check_type=CheckOutputType.PASS_FAIL,
),
)

In this example:

  • name: A unique identifier for the check.
  • description: A brief description of what the check does.
  • prompt_template: The prompt template with template variables (e.g. {generation}, {scenario_result}) that the runtime replaces with actual data. This prompt instructs a judge LLM on how to evaluate.
  • check_type: The output type — CheckOutputType.PASS_FAIL (0 or 1) or CheckOutputType.SCORE (numeric scale, e.g. 1–5).

Template Variables

Your prompt must include at least one template variable where the runtime injects real data. Without one, the judge has nothing to evaluate.

VariableWhat the runtime injects
{generation} or {model_output}The model output being evaluated.
{input} or {scenario_input}The scenario/source text.
{result} or {scenario_result}The reference/expected output.
{model_input}What was sent to the model (prompt or messages).
{message_history}Full conversation (model_input messages + assistant model_output).
{tool_calls}Model's tool/function calls from the result.
{tools}Tool definitions/schema from input.
{simulation_message_history}Simulation message history from metadata.
{audio_messages}Audio input messages.
{audio_output}Audio model output.

Use the template variables that fit your check:

  • Summarization (compare output to source): use {input} (or {scenario_input}) and {generation}.
  • Adherence (follow instructions/context): use {result} (or {scenario_result}) and {generation}.
  • Task completion (full exchange): use {model_input} and {generation}.
  • Function calling: use {model_input}, {generation}, {tool_calls}, and {tools}.
warning

Do not invent new template variable names (e.g. {request}, {history}, {final_message}). Only the variables listed above are substituted at runtime; any other {name} will be left as-is in the prompt and may confuse the judge.

Writing a Good Prompt

A well-structured judge prompt typically includes:

  1. Task / context — What the judge is given and what to rate (one metric).
  2. Evaluation criteria — Metric name, scale (e.g. 0/1 or 1–5), and a short definition.
  3. Evaluation steps — Numbered instructions for the judge to follow.
  4. Examples — At least one input/output example showing the expected format.
  5. Input blocks — Labeled sections with template variables (e.g. "Source Text: {scenario_input}").
  6. Output format — How the judge should respond.

Output format matters. For pass/fail checks, instruct the judge to respond with 0 or 1 in double brackets (e.g. [[1]]). For score checks, use a numeric scale in double brackets (e.g. [[4]]). Optionally ask for an explanation in double parentheses (e.g. ((The summary is coherent.))).


Predefined Model Checks

Okareo provides the following predefined model checks. You can list all available checks in the SDK:

javascript okareo.get_all_checks()

To use any of these checks, specify them by name when running an evaluation:

checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]

// assume that "scenario" is a ScenarioSetResponse object or a UUID

const eval_results: any = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: 'Evaluation Name',
tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario_id: scenario_id,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks,
} as RunTestProps);
`</TabItem>>
<TabItem value="python" label="Python">
`python {1,10} showLineNumbers
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]

# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
name="Evaluation Name",
scenario=scenario,
project_id=project_id,
test_run_type=TestRunType.NL_GENERATION,
calculate_metrics=True,
checks=checks
)

Agent Behavioral Checks

The following checks assess agentic behaviors and instruction-following capabilities.

Behavior Adherence

Name: behavior_adherence

A measurement of how well the model output follows instructions, directives, or commands outlined in the scenario result. This check works best when the scenario result contains a description of expected model behaviors. Focuses on adherence to instructions, not factual accuracy. Output: pass/fail (0 or 1).

Task Completed

Name: task_completed

A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Output is 1 if the model output fulfills the task specified by the latest "user" message in the message list. Otherwise, output is 0.

Result Completed

Name: result_completed

A pass/fail check that compares the conversation message history in the model input/model output against the expected outcome described in the scenario result. Output is 1 if the model output fulfills the outcome specified by the scenario result. Otherwise, output is 0.

Model Refusal

Name: model_refusal

A measurement of whether the model output refuses to respond to the user's question. Output: pass/fail (0 or 1).

Loop Guard

Name: loop_guard

Evaluates whether conversations between AI agents are getting stuck in repetitive patterns (loops) without making meaningful progress. Output: pass/fail (0 or 1).

Automated Resolution

Name: automated_resolution

Evaluates whether the agent escalated the conversation to a third party or resolved it autonomously. Output: pass/fail (0 or 1).

Response Consistency

Name: response_consistency

Checks each assistant message against the entire live dialog state for consistency. Detects contradictions or information drift across turns. Output: pass/fail (0 or 1).

Response Loop

Name: response_loop

Scans each assistant turn against the dialog to detect repeated content with no new information or action. Output: pass/fail (0 or 1).

Simulation Trace Consistency

Name: simulation_trace_consistency

Compares trace message history captured via OpenTelemetry with the originating simulation dialog to ensure they are consistent. Output: pass/fail (0 or 1).

Function Call Checks

These model-based checks use a judge LLM to evaluate function call quality. For deterministic code-based function call checks (AST validation, parameter matching), see Code Checks -- Function Call Checks.

Function Call Consistency

Name: function_call_consistency

A pass/fail check on a function call compared to the model input. Works for a model input containing either a single message or a list of messages. Output is 1 if one or more function calls are present and each function call's name and parameters are consistent with the model input. Otherwise, output is 0.

Function Call Validator

Name: function_call_validator

Evaluates whether the correct tool is selected and executed in the proper sequence. Output: pass/fail (0 or 1).

Function Parameter Accuracy

Name: function_parameter_accuracy

Ensures extracted values from the dialog are correct and match expectations. Output: pass/fail (0 or 1).

Function Result Present

Name: function_result_present

A pass/fail check that evaluates tool messages in a conversation for valid outputs. Examines messages with a "tool" role and verifies their outputs are not null, none, or error responses. Output: pass/fail (0 or 1).

RAG Checks

The following checks assess RAG (Retrieval-Augmented Generation) completions and conversations.

Context Relevance

Name: context_relevance

A measurement of whether the model input has relevant context. Scores from 1 to 5.

Context Consistency

Name: context_consistency

A binary evaluation of whether the model output is fully consistent with the model input. Assesses if the generated output aligns with and is supported by the given input, without contradicting or misrepresenting any information provided. Focuses solely on consistency between input and output, ignoring external facts. Output: pass/fail (0 or 1).

Reference Similarity

Name: reference_similarity

A measurement of how similar the model output is to the scenario result. The model output is the generated answer, and the scenario result is the expected answer. Scores from 1 to 5.

Faithfulness

Name: faithfulness

Checks whether the generated response (model output) is factually consistent with the provided context (scenario result). Scores from 1 to 5.

Summarization Checks

The following model checks use a judge LLM to assess the quality of generated summaries. Each produces an integer score ranging from 1 to 5.

Fluency

Name: fluency_summary

A measurement of quality based on grammar, spelling, punctuation, word choice, and sentence structure. Scores from 1 to 5.

Coherence

Name: coherence_summary

A measurement of structure and organization in the model's output. A higher score indicates a well-structured and well-organized summary. Scores from 1 to 5.

Consistency

Name: consistency_summary

A measurement of factual accuracy between the model output and the model input. Useful in summarization tasks where the model's input is a target document and the output is a summary. Scores from 1 to 5.

Relevance

Name: relevance_summary

A measurement of how much important information from the scenario input is contained in the model output. Rewards highly relevant information and penalizes redundancies or irrelevant content. Scores from 1 to 5.

Safety & Quality Checks

Toxicity

Name: toxicity

Evaluates the toxicity level of model responses, identifying harmful, offensive, abusive, or inappropriate language. Scores from 1 to 5, where 5 is completely non-toxic and 1 is highly toxic. Focuses only on toxicity, not relevance or helpfulness.

Fairness

Name: fairness

Checks whether the generated response is unbiased with respect to individual attributes such as gender, race, age, etc. Scores from 1 to 5.

Voice / Audio Checks

Empathy Score

Name: empathy_score

Evaluates whether the agent's voice indicates an empathetic tone. Designed for audio-based evaluations. Scores from 1 to 5.

Code Generation Checks

These model-based checks evaluate agents that generate code.

Is Code Functional

Name: is_code_functional

A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Assumes the output contains generated code and the input contains a list of messages describing the desired code or modification to a given code snippet. Output is 1 if the code is functional (no logical errors, undefined variables, or incomplete statements). Otherwise, output is 0.

Classification Checks

Is Best Option

Name: is_best_option

A pass/fail check on the model's generated output compared to a list of messages provided in the model input. Assumes the list of messages contains a list of options (e.g. labels, actions, agents) to choose from. Output is 1 if the model output contains the best option from the list. Otherwise, output is 0.

Data Quality Checks

These checks are useful for improving the quality of synthetic data generation. See Data Quality Checks for more details.

Reverse QA Quality

Name: reverse_qa_quality

A measurement of the quality of a generated question in the model output with respect to the context contained in the scenario input. Used primarily with Okareo's reverse question data generators. Scores from 1 to 5.

Rephrase Quality

Name: rephrase_quality

A measurement of the quality of a rephrased text in the model output with respect to a source text contained in the scenario input. Used primarily with Okareo's rephrase data generators. Scores from 1 to 5.