Code Checks
Code Checks
A Code Check uses Python code to evaluate the generated response. This is useful when you need more complex logic or want to incorporate domain-specific knowledge into your check.
Custom Code Checks
To use a custom Code Check:
- Create a new Python file (not in a notebook).
- In this file, define a class named 'Check' that inherits from CodeBasedCheck.
- Implement the
evaluate
method in your Check class. - Include any additional code used by your check in the same file.
Here's an example:
# In my_custom_check.py
from okareo.checks import CodeBasedCheck
class Check(CodeBasedCheck):
@staticmethod
def evaluate(
model_output: str, scenario_input: str, scenario_result: str
) -> Union[bool, int, float]:
# Your evaluation logic here
word_count = len(model_output.split())
return word_count > 10 # Returns True if output has more than 10 words
The evaluate
method should accept model_output
, scenario_input
, and scenario_result
as arguments and return either a boolean, integer, or float.
Then, you can create or update the check using:
check_sample_code = okareo.create_or_update_check(
name="check_sample_code",
description="Check if output has more than 10 words",
check=Check(),
)
Okareo Code Checks
In Okareo, we provide out-of-the-box checks to assess your LLM's performance. In the Okareo SDK, you can list the available checks by running the following method:
- Typescript
- Python
okareo.get_all_checks()
okareo.get_all_checks()
To use any of these checks, you simply specify them when running an evaluation as follows:
- Typescript
- Python
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
// assume that "scenario" is a ScenarioSetResponse object or a UUID
const eval_results: any = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: 'Evaluation Name',
tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario_id: scenario_id,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks,
} as RunTestProps);
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
name="Evaluation Name",
scenario=scenario,
project_id=project_id,
test_run_type=TestRunType.NL_GENERATION,
calculate_metrics=True,
checks=checks
)
As of now, the following out-of-the box code checks are available in Okareo:
does_code_compile
contains_all_imports
compression_ratio
levenshtein_distance
/levenshtein_distance_input
function_call_ast_validator
Natural Language checks
Compression Ratio
Name: compression_ratio
.
The compression ratio is a measure of how much smaller (or larger) a generated text is compared with a scenario input. In Okareo, requesting the compression_ratio
check will invoke the following evaluate
method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str, scenario_input: str) -> float:
return len(model_output) / len(scenario_input)
Levenshtein Distance
Names: levenshtein_distance
, levenshtein_distance_input
.
The Levenshtein distance measures the amount of edits made to a given string where an "edit" can mean either an addition, a deletion, or a substitution. In Okareo, requesting the levenshtein_distance
check will invoke the following evaluate
method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str, scenario_response: str):
# use Levenshtein distance with uniform weights
weights = [1, 1, 1]
return levenshtein_distance(model_output, scenario_response, weights)
def levenshtein_distance(s1, s2, weights):
if len(s1) < len(s2):
return levenshtein_distance(s2, s1, weights)
if len(s2) == 0:
return len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + weights[0]
deletions = current_row[j] + weights[1]
substitutions = previous_row[j] + (c1 != c2) * weights[2]
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
Similarly, the levenshtein_distance_input
call will use the following evaluate
method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str, scenario_input: str):
# use Levenshtein distance with uniform weights
weights = [1, 1, 1]
return levenshtein_distance(model_output, scenario_input, weights)
Function Call Checks
The following checks are used to validate LLMs/agents that generate function calls.
Function Call AST Validator
Name: function_call_ast_validator
.
Validates function calls using the simple AST checker from the Berkeley Function Call Leaderboard repo. The tool call in the model output is compared against the expected structure defined in the scenario result.
Function Call Reference Validator
Name: function_call_reference_validator
Validates function calls by comparing the structure and content of tool calls in the model output against the expected structure defined in the scenario result. It ensures that all required parameters are present and match any specified patterns, supporting nested structures and regex matching for string values.
Do Param Values Match
Name: do_param_values_match
Validates function calls by comparing the structure and content of tool calls in the model output against the expected structure defined in the scenario result. It ensures that all required parameters are present and match any specified patterns, supporting nested structures and regex matching for string values.
Are All Params expected
Name: are_all_params_expected
Checks if the generated argument names in the function call are expected based on the schema in the scenario_result. A check to ensure that the generated arguments are not hallucinated.
Are Required Params present
Name: are_required_params_present
Checks if the generated arguments in the function call contain the required arguments in the scenario_result.
Is Function Correct
Name: is_function_correct
Checks if the generated function call name(s) in the tool_call matches the expected function call name(s) in the scenario_result.
Code Generation checks
Does Code Compile
Name: does_code_compile
.
This check simply checks whether the generated Python code compile
s. This check lets you tell whether the generated code contains any non-Pythonic content (e.g., natural language, HTML, etc.). Requesting the does_code_compile
check will run the following evaluate
method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str) -> bool:
try:
compile(model_output, '<string>', 'exec')
return True
except SyntaxError as e:
return False
Code Contains All Imports
Name: contains_all_imports
.
This check looks at all the object/function calls in the generated code and ensures that the corresponding import
statements are included.