Skip to main content

Introduction to Checks and Metrics

Okareo evaluations span a range of input and output data types, and consequently, we use different metrics to measure the performance across these evaluation types. The following table summarizes the metrics used for each evaluation type.

Evaluation TypeMetrics
GenerationChecks
SimulationChecks
RetrievalPrecision@k, Recall@k, MRR, MAP
ClassificationAccuracy, Precision, Recall, F1

While the metrics under Retrieval and Classification will be familiar to data scientists/machine learning practitioners, checks are unique to Okareo. We provide more details below.

What is a check?

In Okareo, a check is a mechanism for scoring a generative model's output. A check can be narrowly tailored to assess a particular behavior of your LLM.

With checks, you can answer behavioral questions like:

  • Did the check pass? Was the check's threshold exceeded?
  • In what situations did this check fail?
  • Did the check change between Version A and Version B of my model?

Cookbook examples that showcase Okareo checks are available here:

Check types

Okareo provides two types of checks:

  • Code Checks — Python code with an evaluate method that runs on each evaluation row. Use these for deterministic logic, string comparisons, metadata inspection (latency, token counts, cost), or domain-specific evaluation. Returns a score, pass/fail, or CheckResponse with an explanation.

  • Model Checks -- A prompt template evaluated by a judge LLM at runtime. Use these when evaluation requires understanding, reasoning, or subjective judgment (e.g. coherence, adherence, toxicity). The judge returns a score or pass/fail result.

Both types are available as predefined checks (ready to use out of the box) and as custom checks that you create yourself.

All predefined checks at a glance

The following table lists every predefined check in Okareo, organized by category. Click through to the detail pages for full descriptions.

Code-Based Checks

NameCategoryOutput TypeDescription
is_jsonReferenceboolChecks if model output is valid JSON.
exact_matchReferenceboolExact string/dict match between model output and scenario result.
fuzzy_matchReferenceboolFuzzy string match between model output and scenario result.
compression_ratioNatural LanguagefloatRatio of model output length to scenario input length.
levenshtein_distanceNatural LanguageintEdit distance between model output and scenario result.
levenshtein_distance_inputNatural LanguageintEdit distance between model output and scenario input.
corpus_BLEUNatural LanguagefloatCorpus BLEU score (model output vs scenario result).
latencyPerformancefloatResponse latency (ms).
avg_turn_latencyPerformancefloatAverage response time per turn (ms).
avg_turn_taking_latencyPerformancefloatAverage TTS response time (ms).
avg_words_per_minutePerformancefloatAverage WPM of generated audio.
input_tokensPerformancefloatInput token count for a single request.
output_tokensPerformancefloatOutput token count for a single response.
costPerformancefloatCost for a single request.
total_input_tokensPerformancefloatTotal input tokens across turns.
total_output_tokensPerformancefloatTotal output tokens across turns.
total_costPerformancefloatTotal cost across turns.
total_turn_countPerformancefloatTotal user-assistant turn pairs.
function_call_ast_validatorFunction CallboolAST-based function call validation.
function_call_conversation_ast_validatorFunction CallboolMulti-turn AST function call validation.
function_call_reference_validatorFunction CallfloatStructure/content comparison of function calls.
is_function_correctFunction CallboolFunction call name matches expected.
are_required_params_presentFunction CallboolRequired params present in function call.
are_all_params_expectedFunction CallboolNo hallucinated params in function call.
do_param_values_matchFunction CallboolParam values match expected values.
does_code_compileCode GenerationboolChecks if generated Python code compiles.
contains_all_importsCode GenerationboolGenerated code has all required imports.

Model Checks

NameCategoryOutput TypeDescription
behavior_adherenceAgent Behavioralpass/failAdherence to instructions in scenario result.
task_completedAgent Behavioralpass/failModel output fulfills task from message history.
result_completedAgent Behavioralpass/failModel output matches expected outcome in scenario result.
model_refusalAgent Behavioralpass/failWhether model refuses to respond.
loop_guardAgent Behavioralpass/failDetects repetitive conversation patterns.
automated_resolutionAgent Behavioralpass/failWhether agent resolved without escalation.
response_consistencyAgent Behavioralpass/failAssistant consistency across dialog turns.
response_loopAgent Behavioralpass/failDetects repeated content with no new info.
simulation_trace_consistencyAgent Behavioralpass/failTrace vs simulation dialog consistency.
function_call_consistencyFunction Callpass/failFunction call consistent with model input.
function_call_validatorFunction Callpass/failCorrect tool selected and sequenced.
function_parameter_accuracyFunction Callpass/failExtracted values correct and match expectations.
function_result_presentFunction Callpass/failTool messages have valid (non-null) outputs.
context_relevanceRAGscore (1-5)Whether model input has relevant context.
context_consistencyRAGpass/failModel output consistent with model input.
reference_similarityRAGscore (1-5)Similarity between model output and scenario result.
faithfulnessRAGscore (1-5)Factual consistency with provided context.
fluency_summarySummarizationscore (1-5)Grammar, spelling, word choice quality.
coherence_summarySummarizationscore (1-5)Structure and organization quality.
consistency_summarySummarizationscore (1-5)Factual accuracy vs source.
relevance_summarySummarizationscore (1-5)Important info captured, redundancy penalized.
toxicitySafety & Qualityscore (1-5)Harmful/offensive language detection.
fairnessSafety & Qualityscore (1-5)Bias detection across attributes.
empathy_scoreVoice / Audioscore (1-5)Empathetic tone in voice output.
is_code_functionalCode Generationpass/failGenerated code is functional and complete.
is_best_optionClassificationpass/failBest option selected from list.
reverse_qa_qualityData Qualityscore (1-5)Quality of generated question vs context.
rephrase_qualityData Qualityscore (1-5)Quality of rephrased text vs source.