Introduction to Checks and Metrics
Okareo evaluations span a range of input and output data types, and consequently, we use different metrics to measure the performance across these evaluation types. The following table summarizes the metrics used for each evaluation type.
| Evaluation Type | Metrics |
|---|---|
| Generation | Checks |
| Simulation | Checks |
| Retrieval | Precision@k, Recall@k, MRR, MAP |
| Classification | Accuracy, Precision, Recall, F1 |
While the metrics under Retrieval and Classification will be familiar to data scientists/machine learning practitioners, checks are unique to Okareo. We provide more details below.
What is a check?
In Okareo, a check is a mechanism for scoring a generative model's output. A check can be narrowly tailored to assess a particular behavior of your LLM.
With checks, you can answer behavioral questions like:
- Did the check pass? Was the check's threshold exceeded?
- In what situations did this check fail?
- Did the check change between Version A and Version B of my model?
Cookbook examples that showcase Okareo checks are available here:
- Colab Notebook
- Typescript Cookbook (Clone
okareo-cookbookrepo and download theokareo-cli)
Check types
Okareo provides two types of checks:
-
Code Checks — Python code with an
evaluatemethod that runs on each evaluation row. Use these for deterministic logic, string comparisons, metadata inspection (latency, token counts, cost), or domain-specific evaluation. Returns a score, pass/fail, orCheckResponsewith an explanation. -
Model Checks -- A prompt template evaluated by a judge LLM at runtime. Use these when evaluation requires understanding, reasoning, or subjective judgment (e.g. coherence, adherence, toxicity). The judge returns a score or pass/fail result.
Both types are available as predefined checks (ready to use out of the box) and as custom checks that you create yourself.
All predefined checks at a glance
The following table lists every predefined check in Okareo, organized by category. Click through to the detail pages for full descriptions.
Code-Based Checks
| Name | Category | Output Type | Description |
|---|---|---|---|
is_json | Reference | bool | Checks if model output is valid JSON. |
exact_match | Reference | bool | Exact string/dict match between model output and scenario result. |
fuzzy_match | Reference | bool | Fuzzy string match between model output and scenario result. |
compression_ratio | Natural Language | float | Ratio of model output length to scenario input length. |
levenshtein_distance | Natural Language | int | Edit distance between model output and scenario result. |
levenshtein_distance_input | Natural Language | int | Edit distance between model output and scenario input. |
corpus_BLEU | Natural Language | float | Corpus BLEU score (model output vs scenario result). |
latency | Performance | float | Response latency (ms). |
avg_turn_latency | Performance | float | Average response time per turn (ms). |
avg_turn_taking_latency | Performance | float | Average TTS response time (ms). |
avg_words_per_minute | Performance | float | Average WPM of generated audio. |
input_tokens | Performance | float | Input token count for a single request. |
output_tokens | Performance | float | Output token count for a single response. |
cost | Performance | float | Cost for a single request. |
total_input_tokens | Performance | float | Total input tokens across turns. |
total_output_tokens | Performance | float | Total output tokens across turns. |
total_cost | Performance | float | Total cost across turns. |
total_turn_count | Performance | float | Total user-assistant turn pairs. |
function_call_ast_validator | Function Call | bool | AST-based function call validation. |
function_call_conversation_ast_validator | Function Call | bool | Multi-turn AST function call validation. |
function_call_reference_validator | Function Call | float | Structure/content comparison of function calls. |
is_function_correct | Function Call | bool | Function call name matches expected. |
are_required_params_present | Function Call | bool | Required params present in function call. |
are_all_params_expected | Function Call | bool | No hallucinated params in function call. |
do_param_values_match | Function Call | bool | Param values match expected values. |
does_code_compile | Code Generation | bool | Checks if generated Python code compiles. |
contains_all_imports | Code Generation | bool | Generated code has all required imports. |
Model Checks
| Name | Category | Output Type | Description |
|---|---|---|---|
behavior_adherence | Agent Behavioral | pass/fail | Adherence to instructions in scenario result. |
task_completed | Agent Behavioral | pass/fail | Model output fulfills task from message history. |
result_completed | Agent Behavioral | pass/fail | Model output matches expected outcome in scenario result. |
model_refusal | Agent Behavioral | pass/fail | Whether model refuses to respond. |
loop_guard | Agent Behavioral | pass/fail | Detects repetitive conversation patterns. |
automated_resolution | Agent Behavioral | pass/fail | Whether agent resolved without escalation. |
response_consistency | Agent Behavioral | pass/fail | Assistant consistency across dialog turns. |
response_loop | Agent Behavioral | pass/fail | Detects repeated content with no new info. |
simulation_trace_consistency | Agent Behavioral | pass/fail | Trace vs simulation dialog consistency. |
function_call_consistency | Function Call | pass/fail | Function call consistent with model input. |
function_call_validator | Function Call | pass/fail | Correct tool selected and sequenced. |
function_parameter_accuracy | Function Call | pass/fail | Extracted values correct and match expectations. |
function_result_present | Function Call | pass/fail | Tool messages have valid (non-null) outputs. |
context_relevance | RAG | score (1-5) | Whether model input has relevant context. |
context_consistency | RAG | pass/fail | Model output consistent with model input. |
reference_similarity | RAG | score (1-5) | Similarity between model output and scenario result. |
faithfulness | RAG | score (1-5) | Factual consistency with provided context. |
fluency_summary | Summarization | score (1-5) | Grammar, spelling, word choice quality. |
coherence_summary | Summarization | score (1-5) | Structure and organization quality. |
consistency_summary | Summarization | score (1-5) | Factual accuracy vs source. |
relevance_summary | Summarization | score (1-5) | Important info captured, redundancy penalized. |
toxicity | Safety & Quality | score (1-5) | Harmful/offensive language detection. |
fairness | Safety & Quality | score (1-5) | Bias detection across attributes. |
empathy_score | Voice / Audio | score (1-5) | Empathetic tone in voice output. |
is_code_functional | Code Generation | pass/fail | Generated code is functional and complete. |
is_best_option | Classification | pass/fail | Best option selected from list. |
reverse_qa_quality | Data Quality | score (1-5) | Quality of generated question vs context. |
rephrase_quality | Data Quality | score (1-5) | Quality of rephrased text vs source. |