Skip to main content

Code Checks

A Code Check uses Python code to evaluate the generated response. Your code runs on each evaluation row and returns a result — a pass/fail boolean, a numeric score, or a CheckResponse with an explanation. This is useful when you need deterministic logic, string comparisons, metadata inspection, or domain-specific evaluation that doesn't require a model check.

Okareo provides many predefined code checks out of the box. You can also write your own custom code checks.


Custom Code Checks

To create a custom Code Check:

  1. Create a new Python file (not in a notebook).
  2. Define a class named Check that inherits from CodeBasedCheck.
  3. Implement a static evaluate method.
  4. Include any additional code used by your check in the same file.

Here's an example:

# In my_custom_check.py
from okareo.checks import CodeBasedCheck, CheckResponse

class Check(CodeBasedCheck):
@staticmethod
def evaluate(model_output: str, scenario_result: str) -> CheckResponse:
passed = model_output.strip().lower() == scenario_result.strip().lower()
return CheckResponse(
score=passed,
explanation="Exact match." if passed else "Output did not match reference.",
)

Then, create or update the check using:

check_sample_code = okareo.create_or_update_check(
name="check_sample_code",
description="Check if output matches the expected result exactly",
check=Check(),
)

For the full SDK workflow (generating, uploading, and running custom checks), see Custom Checks.

Allowed Parameters

Your evaluate method may use any subset of the parameters below. The runtime passes values by name, so parameter order does not matter. Use only the parameters your check needs.

ParameterTypeWhat the runtime injects
model_outputstrThe model output being evaluated.
scenario_inputstrThe scenario/source text (when the run has scenario data).
scenario_resultstrThe reference/expected output (when the run has reference data).
metadatadictPer-row metadata — see The metadata dict below.
model_inputvariesWhat was sent to the model (e.g. prompt or messages).
audio_messagesvariesAudio input; may be None if not applicable.
audio_outputvariesAudio output; may be None if not applicable.

Any other parameter name will cause a validation or runtime failure.

Return Types

Your evaluate method must return one of the following shapes:

ShapeDescriptionExample
Single valuebool, int, or float. Score only, no explanation.return len(model_output) > 0
CheckResponseNamed type with score, explanation, and optional metadata. Recommended.return CheckResponse(score=True, explanation="Non-empty output.")
2-tuple(explanation, score) — order is required.return ("Output is non-empty.", True)
3-tuple(explanation, score, metadata) — order is required.return ("OK", 0.95, {"detail": "high"})

We recommend returning CheckResponse(score=..., explanation=...) so the UI can display why a check passed or failed.

warning

If you return a tuple, the order must be (explanation, score) — not (score, explanation). Returning them in the wrong order will cause the values to be misinterpreted at runtime.

Allowed Imports

You may only import from these modules. Imports from any other module will fail validation.

  • Standard / data: string, re, json, ast, jsonschema, difflib, math, statistics, collections, itertools, functools, typing, enum, datetime
  • NLP / metrics: Levenshtein, rapidfuzz, jiwer, nltk, sacrebleu, rouge_score, spacy, sklearn, numpy
  • Okareo: okareo (e.g. from okareo.checks import CodeBasedCheck, CheckResponse)

Dangerous constructs like exec, eval, and open are also disallowed.

The metadata Dict

When you include a metadata parameter, it is a dict that may contain the following keys (depending on the run type):

KeyMeaning
latencyResponse latency (ms).
input_tokens / output_tokensToken counts for the request.
costCost for the request.
tool_callsModel's tool/function calls (list/dict).
toolsTool definitions/schema from input.
turn_taking_latencyVoice/turn-taking latency when applicable.
words_per_minuteVoice words-per-minute when applicable.
simulation_message_historyMessage history when simulation/trace context exists.
user_<key>User-defined keys from datapoint metadata.

Not all keys are present in every run. Use metadata.get(key, default) when reading.


Predefined Code Checks

Okareo provides the following predefined code-based checks. You can list all available checks in the SDK:

javascript okareo.get_all_checks()

To use any of these checks, specify them by name when running an evaluation:

checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]

// assume that "scenario" is a ScenarioSetResponse object or a UUID

const eval_results: any = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: 'Evaluation Name',
tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario_id: scenario_id,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks,
} as RunTestProps);
`</TabItem>>
<TabItem value="python" label="Python">
`python {1,10} showLineNumbers
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]

# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
name="Evaluation Name",
scenario=scenario,
project_id=project_id,
test_run_type=TestRunType.NL_GENERATION,
calculate_metrics=True,
checks=checks
)

Reference Checks

Is JSON

Name: is_json

Checks whether the model output is valid JSON. Returns True if the output parses as JSON, False otherwise.

Exact Match

Name: exact_match

Checks for an exact string or dict match between the model_output and the scenario_result.

Fuzzy Match

Name: fuzzy_match

Checks for a fuzzy string match between the model_output and the scenario_result. Uses Python's difflib.SequenceMatcher(...).real_quick_ratio() method (difflib documentation). If the ratio exceeds a threshold, the sequences are considered a fuzzy match.

Natural Language Checks

Compression Ratio

Name: compression_ratio

Measures how much smaller (or larger) the generated text is compared to the scenario input. Returns len(model_output) / len(scenario_input).

Levenshtein Distance

Names: levenshtein_distance, levenshtein_distance_input

Measures the number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.

  • levenshtein_distance compares model_output against scenario_result.
  • levenshtein_distance_input compares model_output against scenario_input.

Corpus BLEU

Name: corpus_BLEU

Computes the corpus BLEU score using NLTK's corpus_bleu method, comparing the sentences in model_output (candidate) against the sentences in scenario_result (reference).

Performance & Metadata Checks

These checks read from the metadata dict to report performance metrics. They are useful for monitoring latency, token usage, and cost.

Latency

Name: latency

Measures the response latency (in milliseconds) of the model.

Average Turn Latency

Name: avg_turn_latency

Measures the average response time (in milliseconds) of the target model across turns. Applicable to multi-turn evaluations.

Average Turn-Taking Latency

Name: avg_turn_taking_latency

Measures the average response time (in milliseconds) of a voice target when processing a TTS request.

Average Words Per Minute

Name: avg_words_per_minute

Measures the average words per minute (WPM) of the audio generated by a voice target.

Input Tokens

Name: input_tokens

Measures the number of input tokens used in a single model request.

Output Tokens

Name: output_tokens

Measures the number of output tokens used in a single model response.

Cost

Name: cost

Measures the cost of a single model request.

Total Input Tokens

Name: total_input_tokens

Measures the total number of input tokens for the target model in the scenario. When used in multi-turn, it is the total across all turns up to the current turn.

Total Output Tokens

Name: total_output_tokens

Measures the total number of output tokens for the target model in the scenario. When used in multi-turn, it is the total across all turns up to the current turn.

Total Cost

Name: total_cost

Measures the total cost for the target model in the scenario. When used in multi-turn, it is the total across all turns up to the current turn.

Total Turn Count

Name: total_turn_count

Measures the total number of turns as user-assistant pairs, ignoring system messages and assistant tool-only stubs.

Function Call Checks (Code-Based)

The following code-based checks validate the structure and content of LLM/agent function calls. For model-based function call checks that use a judge LLM to evaluate, see Model Checks -- Function Call Checks.

Function Call AST Validator

Name: function_call_ast_validator

Validates function calls using the simple AST checker from the Berkeley Function Call Leaderboard repo. The tool call in the model output is compared against the expected structure defined in the scenario result.

Function Call Conversation AST Validator

Name: function_call_conversation_ast_validator

Validates all function calls across a multi-turn conversation using the AST checker. Extends function_call_ast_validator to handle conversations with multiple function calls.

Function Call Reference Validator

Name: function_call_reference_validator

Validates function calls by comparing the structure and content of tool calls in the model output against the expected structure defined in the scenario result. Supports nested structures and regex matching for string values.

Is Function Correct

Name: is_function_correct

Checks if the generated function call name(s) in the tool call match the expected function call name(s) in the scenario result.

Are Required Params Present

Name: are_required_params_present

Checks if the generated arguments in the function call contain the required arguments specified in the scenario result.

Are All Params Expected

Name: are_all_params_expected

Checks if the generated argument names in the function call are expected based on the schema in the scenario result. Ensures that the generated arguments are not hallucinated.

Do Param Values Match

Name: do_param_values_match

Checks if each generated argument value in the function call matches the corresponding argument value in the scenario result.

Code Generation Checks

Does Code Compile

Name: does_code_compile

Checks whether the generated Python code compiles. Useful for verifying that the model output contains valid Python rather than natural language, HTML, or other non-Python content.

Code Contains All Imports

Name: contains_all_imports

Checks that all the object/function calls in the generated code have the corresponding import statements included.