Code Checks
Code Checks
A Code Check uses Python code to evaluate the generated response. This is useful when you need more complex logic or want to incorporate domain-specific knowledge into your check.
Custom Code Checks
To use a custom Code Check:
- Create a new Python file (not in a notebook).
- In this file, define a class named 'Check' that inherits from CodeBasedCheck.
- Implement the
evaluatemethod in your Check class. - Include any additional code used by your check in the same file.
Here's an example:
# In my_custom_check.py
from okareo.checks import CodeBasedCheck, CheckResponse
class Check(CodeBasedCheck):
@staticmethod
def evaluate(
model_output: str, scenario_input: str, scenario_result: str
) -> CheckResponse:
# Your evaluation logic here
word_count = len(model_output.split())
score = word_count > 10
explanation = "The output contains 10 words or fewer"
if score:
explanation = "The output contains more than 10 words"
return CheckResponse(
score = score,
explanation = explanation
)
The evaluate method should accept model_output, scenario_input, and scenario_result as arguments and return a CheckResponse object with a score value of either a boolean, integer, or float type along with an optional explanation of string type.
Then, you can create or update the check using:
check_sample_code = okareo.create_or_update_check(
name="check_sample_code",
description="Check if output has more than 10 words",
check=Check(),
)
Okareo Code Checks
In Okareo, we provide out-of-the-box checks to assess your LLM's performance. In the Okareo SDK, you can list the available checks by running the following method:
- Typescript
- Python
okareo.get_all_checks()
okareo.get_all_checks()
To use any of these checks, you simply specify them when running an evaluation as follows:
- Typescript
- Python
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
// assume that "scenario" is a ScenarioSetResponse object or a UUID
const eval_results: any = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: 'Evaluation Name',
tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario_id: scenario_id,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks,
} as RunTestProps);
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
name="Evaluation Name",
scenario=scenario,
project_id=project_id,
test_run_type=TestRunType.NL_GENERATION,
calculate_metrics=True,
checks=checks
)
As of now, the following out-of-the box code checks are available in Okareo:
does_code_compilecontains_all_importscompression_ratiolevenshtein_distance/levenshtein_distance_inputfunction_call_ast_validatorexact_matchfuzzy_match
Natural Language Checks
Compression Ratio
Name: compression_ratio.
The compression ratio is a measure of how much smaller (or larger) a generated text is compared with a scenario input. In Okareo, requesting the compression_ratio check will invoke the following evaluate method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str, scenario_input: str) -> float:
return len(model_output) / len(scenario_input)
Levenshtein Distance
Names: levenshtein_distance, levenshtein_distance_input.
The Levenshtein distance measures the amount of edits made to a given string where an "edit" can mean either an addition, a deletion, or a substitution. In Okareo, requesting the levenshtein_distance check will invoke the following evaluate method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str, scenario_response: str):
# use Levenshtein distance with uniform weights
weights = [1, 1, 1]
return levenshtein_distance(model_output, scenario_response, weights)
def levenshtein_distance(s1, s2, weights):
if len(s1) < len(s2):
return levenshtein_distance(s2, s1, weights)
if len(s2) == 0:
return len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + weights[0]
deletions = current_row[j] + weights[1]
substitutions = previous_row[j] + (c1 != c2) * weights[2]
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
Similarly, the levenshtein_distance_input call will use the following evaluate method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str, scenario_input: str):
# use Levenshtein distance with uniform weights
weights = [1, 1, 1]
return levenshtein_distance(model_output, scenario_input, weights)
Reference Checks
These checks can be used on a variety of generation modalities (e.g., natural language, JSON) to check if the generated output matches the reference contained in the scenario_result.
Exact Match
Name: exact_match
Checks for an exact string or dict match between the model_output and the scenario_result.
Fuzzy Match
Name: fuzzy_match
Checks for a fuzzy string match between the model_output and the scenario_result. The check uses the Python builtin difflib module's SequenceMatcher(...).real_quick_ratio() method (difflib documentation). If the ratio exceeds a threshold, then the sequences are considered a fuzzy match.
Function Call Checks
The following checks are used to validate LLMs/agents that generate function calls.
Function Call AST Validator
Name: function_call_ast_validator.
Validates function calls using the simple AST checker from the Berkeley Function Call Leaderboard repo. The tool call in the model output is compared against the expected structure defined in the scenario result.
Function Call Reference Validator
Name: function_call_reference_validator
Validates function calls by comparing the structure and content of tool calls in the model output against the expected structure defined in the scenario result. It ensures that all required parameters are present and match any specified patterns, supporting nested structures and regex matching for string values.
Do Param Values Match
Name: do_param_values_match
Validates function calls by comparing the structure and content of tool calls in the model output against the expected structure defined in the scenario result. It ensures that all required parameters are present and match any specified patterns, supporting nested structures and regex matching for string values.
Are All Params expected
Name: are_all_params_expected
Checks if the generated argument names in the function call are expected based on the schema in the scenario_result. A check to ensure that the generated arguments are not hallucinated.
Are Required Params present
Name: are_required_params_present
Checks if the generated arguments in the function call contain the required arguments in the scenario_result.
Is Function Correct
Name: is_function_correct
Checks if the generated function call name(s) in the tool_call matches the expected function call name(s) in the scenario_result.
Code Generation Checks
Does Code Compile
Name: does_code_compile.
This check simply checks whether the generated Python code compiles. This check lets you tell whether the generated code contains any non-Pythonic content (e.g., natural language, HTML, etc.). Requesting the does_code_compile check will run the following evaluate method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str) -> bool:
try:
compile(model_output, '<string>', 'exec')
return True
except SyntaxError as e:
return False
Code Contains All Imports
Name: contains_all_imports.
This check looks at all the object/function calls in the generated code and ensures that the corresponding import statements are included.