Code Checks
A Code Check uses Python code to evaluate the generated response. Your code runs on each evaluation row and returns a result — a pass/fail boolean, a numeric score, or a CheckResponse with an explanation. This is useful when you need deterministic logic, string comparisons, metadata inspection, or domain-specific evaluation that doesn't require a model check.
Okareo provides many predefined code checks out of the box. You can also write your own custom code checks.
Custom Code Checks
To create a custom Code Check:
- Create a new Python file (not in a notebook).
- Define a class named
Checkthat inherits fromCodeBasedCheck. - Implement a static
evaluatemethod. - Include any additional code used by your check in the same file.
Here's an example:
# In my_custom_check.py
from okareo.checks import CodeBasedCheck, CheckResponse
class Check(CodeBasedCheck):
@staticmethod
def evaluate(model_output: str, scenario_result: str) -> CheckResponse:
passed = model_output.strip().lower() == scenario_result.strip().lower()
return CheckResponse(
score=passed,
explanation="Exact match." if passed else "Output did not match reference.",
)
Then, create or update the check using:
check_sample_code = okareo.create_or_update_check(
name="check_sample_code",
description="Check if output matches the expected result exactly",
check=Check(),
)
For the full SDK workflow (generating, uploading, and running custom checks), see Custom Checks.
Allowed Parameters
Your evaluate method may use any subset of the parameters below. The runtime passes values by name, so parameter order does not matter. Use only the parameters your check needs.
| Parameter | Type | What the runtime injects |
|---|---|---|
model_output | str | The model output being evaluated. |
scenario_input | str | The scenario/source text (when the run has scenario data). |
scenario_result | str | The reference/expected output (when the run has reference data). |
metadata | dict | Per-row metadata — see The metadata dict below. |
model_input | varies | What was sent to the model (e.g. prompt or messages). |
audio_messages | varies | Audio input; may be None if not applicable. |
audio_output | varies | Audio output; may be None if not applicable. |
Any other parameter name will cause a validation or runtime failure.
Return Types
Your evaluate method must return one of the following shapes:
| Shape | Description | Example |
|---|---|---|
| Single value | bool, int, or float. Score only, no explanation. | return len(model_output) > 0 |
CheckResponse | Named type with score, explanation, and optional metadata. Recommended. | return CheckResponse(score=True, explanation="Non-empty output.") |
| 2-tuple | (explanation, score) — order is required. | return ("Output is non-empty.", True) |
| 3-tuple | (explanation, score, metadata) — order is required. | return ("OK", 0.95, {"detail": "high"}) |
We recommend returning CheckResponse(score=..., explanation=...) so the UI can display why a check passed or failed.
If you return a tuple, the order must be (explanation, score) — not (score, explanation). Returning them in the wrong order will cause the values to be misinterpreted at runtime.
Allowed Imports
You may only import from these modules. Imports from any other module will fail validation.
- Standard / data:
string,re,json,ast,jsonschema,difflib,math,statistics,collections,itertools,functools,typing,enum,datetime - NLP / metrics:
Levenshtein,rapidfuzz,jiwer,nltk,sacrebleu,rouge_score,spacy,sklearn,numpy - Okareo:
okareo(e.g.from okareo.checks import CodeBasedCheck, CheckResponse)
Dangerous constructs like exec, eval, and open are also disallowed.
The metadata Dict
When you include a metadata parameter, it is a dict that may contain the following keys (depending on the run type):
| Key | Meaning |
|---|---|
latency | Response latency (ms). |
input_tokens / output_tokens | Token counts for the request. |
cost | Cost for the request. |
tool_calls | Model's tool/function calls (list/dict). |
tools | Tool definitions/schema from input. |
turn_taking_latency | Voice/turn-taking latency when applicable. |
words_per_minute | Voice words-per-minute when applicable. |
simulation_message_history | Message history when simulation/trace context exists. |
user_<key> | User-defined keys from datapoint metadata. |
Not all keys are present in every run. Use metadata.get(key, default) when reading.
Predefined Code Checks
Okareo provides the following predefined code-based checks. You can list all available checks in the SDK:
- Typescript
- Python
javascript okareo.get_all_checks()
python okareo.get_all_checks()
To use any of these checks, specify them by name when running an evaluation:
- Typescript
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
// assume that "scenario" is a ScenarioSetResponse object or a UUID
const eval_results: any = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: 'Evaluation Name',
tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario_id: scenario_id,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks,
} as RunTestProps);
`</TabItem>>
<TabItem value="python" label="Python">
`python {1,10} showLineNumbers
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
name="Evaluation Name",
scenario=scenario,
project_id=project_id,
test_run_type=TestRunType.NL_GENERATION,
calculate_metrics=True,
checks=checks
)
Reference Checks
Is JSON
Name: is_json
Checks whether the model output is valid JSON. Returns True if the output parses as JSON, False otherwise.
Exact Match
Name: exact_match
Checks for an exact string or dict match between the model_output and the scenario_result.
Fuzzy Match
Name: fuzzy_match
Checks for a fuzzy string match between the model_output and the scenario_result. Uses Python's difflib.SequenceMatcher(...).real_quick_ratio() method (difflib documentation). If the ratio exceeds a threshold, the sequences are considered a fuzzy match.
Natural Language Checks
Compression Ratio
Name: compression_ratio
Measures how much smaller (or larger) the generated text is compared to the scenario input. Returns len(model_output) / len(scenario_input).
Levenshtein Distance
Names: levenshtein_distance, levenshtein_distance_input
Measures the number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.
levenshtein_distancecomparesmodel_outputagainstscenario_result.levenshtein_distance_inputcomparesmodel_outputagainstscenario_input.
Corpus BLEU
Name: corpus_BLEU
Computes the corpus BLEU score using NLTK's corpus_bleu method, comparing the sentences in model_output (candidate) against the sentences in scenario_result (reference).
Performance & Metadata Checks
These checks read from the metadata dict to report performance metrics. They are useful for monitoring latency, token usage, and cost.
Latency
Name: latency
Measures the response latency (in milliseconds) of the model.
Average Turn Latency
Name: avg_turn_latency
Measures the average response time (in milliseconds) of the target model across turns. Applicable to multi-turn evaluations.
Average Turn-Taking Latency
Name: avg_turn_taking_latency
Measures the average response time (in milliseconds) of a voice target when processing a TTS request.
Average Words Per Minute
Name: avg_words_per_minute
Measures the average words per minute (WPM) of the audio generated by a voice target.
Input Tokens
Name: input_tokens
Measures the number of input tokens used in a single model request.
Output Tokens
Name: output_tokens
Measures the number of output tokens used in a single model response.
Cost
Name: cost
Measures the cost of a single model request.
Total Input Tokens
Name: total_input_tokens
Measures the total number of input tokens for the target model in the scenario. When used in multi-turn, it is the total across all turns up to the current turn.
Total Output Tokens
Name: total_output_tokens
Measures the total number of output tokens for the target model in the scenario. When used in multi-turn, it is the total across all turns up to the current turn.
Total Cost
Name: total_cost
Measures the total cost for the target model in the scenario. When used in multi-turn, it is the total across all turns up to the current turn.
Total Turn Count
Name: total_turn_count
Measures the total number of turns as user-assistant pairs, ignoring system messages and assistant tool-only stubs.
Function Call Checks (Code-Based)
The following code-based checks validate the structure and content of LLM/agent function calls. For model-based function call checks that use a judge LLM to evaluate, see Model Checks -- Function Call Checks.
Function Call AST Validator
Name: function_call_ast_validator
Validates function calls using the simple AST checker from the Berkeley Function Call Leaderboard repo. The tool call in the model output is compared against the expected structure defined in the scenario result.
Function Call Conversation AST Validator
Name: function_call_conversation_ast_validator
Validates all function calls across a multi-turn conversation using the AST checker. Extends function_call_ast_validator to handle conversations with multiple function calls.
Function Call Reference Validator
Name: function_call_reference_validator
Validates function calls by comparing the structure and content of tool calls in the model output against the expected structure defined in the scenario result. Supports nested structures and regex matching for string values.
Is Function Correct
Name: is_function_correct
Checks if the generated function call name(s) in the tool call match the expected function call name(s) in the scenario result.
Are Required Params Present
Name: are_required_params_present
Checks if the generated arguments in the function call contain the required arguments specified in the scenario result.
Are All Params Expected
Name: are_all_params_expected
Checks if the generated argument names in the function call are expected based on the schema in the scenario result. Ensures that the generated arguments are not hallucinated.
Do Param Values Match
Name: do_param_values_match
Checks if each generated argument value in the function call matches the corresponding argument value in the scenario result.
Code Generation Checks
Does Code Compile
Name: does_code_compile
Checks whether the generated Python code compiles. Useful for verifying that the model output contains valid Python rather than natural language, HTML, or other non-Python content.
Code Contains All Imports
Name: contains_all_imports
Checks that all the object/function calls in the generated code have the corresponding import statements included.