Code Checks

A Code Check uses Python code to evaluate the generated response. This is useful when you need more complex logic or want to incorporate domain-specific knowledge into your check.

Custom Code Checks

To use a custom Code Check:

Create a new Python file (not in a notebook).
In this file, define a class named 'Check' that inherits from CodeBasedCheck.
Implement the evaluate method in your Check class.
Include any additional code used by your check in the same file.

Here's an example:

# In my_custom_check.py
from okareo.checks import CodeBasedCheck, CheckResponse

class Check(CodeBasedCheck):
    @staticmethod
    def evaluate(
        model_output: str, scenario_input: str, scenario_result: str
    ) -> CheckResponse:
        # Your evaluation logic here
        word_count = len(model_output.split())
        score = word_count > 10 
        explanation = "The output contains 10 words or fewer"
        if score:
            explanation = "The output contains more than 10 words"
        return CheckResponse(
            score = score,
            explanation = explanation
        )

The evaluate method should accept model_output, scenario_input, and scenario_result as arguments and return a CheckResponse object with a score value of either a boolean, integer, or float type along with an optional explanation of string type.

Then, you can create or update the check using:

check_sample_code = okareo.create_or_update_check(
    name="check_sample_code",
    description="Check if output has more than 10 words",
    check=Check(),
)

Okareo Code Checks

In Okareo, we provide out-of-the-box checks to assess your LLM's performance. In the Okareo SDK, you can list the available checks by running the following method:

Typescript
Python

okareo.get_all_checks()

okareo.get_all_checks()

To use any of these checks, you simply specify them when running an evaluation as follows:

Typescript
Python

checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]

// assume that "scenario" is a ScenarioSetResponse object or a UUID
const eval_results: any = await model.run_test({
    model_api_key: OPENAI_API_KEY,
    name: 'Evaluation Name',
    tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
    project_id: project_id,
    scenario_id: scenario_id,
    calculate_metrics: true,
    type: TestRunType.NL_GENERATION,
    checks: checks,
} as RunTestProps);

checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]

# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
    name="Evaluation Name",
    scenario=scenario,
    project_id=project_id,
    test_run_type=TestRunType.NL_GENERATION,
    calculate_metrics=True,
    checks=checks
)

As of now, the following out-of-the box code checks are available in Okareo:

does_code_compile
contains_all_imports
compression_ratio
levenshtein_distance/levenshtein_distance_input
function_call_ast_validator
exact_match
fuzzy_match

Natural Language Checks

Compression Ratio

Name: compression_ratio.

The compression ratio is a measure of how much smaller (or larger) a generated text is compared with a scenario input. In Okareo, requesting the compression_ratio check will invoke the following evaluate method:

class Check(BaseCheck):
    @staticmethod
    def evaluate(model_output: str, scenario_input: str) -> float:
        return len(model_output) / len(scenario_input)

Levenshtein Distance

Names: levenshtein_distance, levenshtein_distance_input.

The Levenshtein distance measures the amount of edits made to a given string where an "edit" can mean either an addition, a deletion, or a substitution. In Okareo, requesting the levenshtein_distance check will invoke the following evaluate method:

class Check(BaseCheck):
    @staticmethod
    def evaluate(model_output: str, scenario_response: str):
        # use Levenshtein distance with uniform weights
        weights = [1, 1, 1]
        return levenshtein_distance(model_output, scenario_response, weights)

def levenshtein_distance(s1, s2, weights):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1, weights)

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + weights[0]
            deletions = current_row[j] + weights[1]
            substitutions = previous_row[j] + (c1 != c2) * weights[2]
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row
    
    return previous_row[-1]

Similarly, the levenshtein_distance_input call will use the following evaluate method:

class Check(BaseCheck):
    @staticmethod
    def evaluate(model_output: str, scenario_input: str):
        # use Levenshtein distance with uniform weights
        weights = [1, 1, 1]
        return levenshtein_distance(model_output, scenario_input, weights)

Reference Checks

These checks can be used on a variety of generation modalities (e.g., natural language, JSON) to check if the generated output matches the reference contained in the scenario_result.

Exact Match

Name: exact_match

Checks for an exact string or dict match between the model_output and the scenario_result.

Fuzzy Match

Name: fuzzy_match

Checks for a fuzzy string match between the model_output and the scenario_result. The check uses the Python builtin difflib module's SequenceMatcher(...).real_quick_ratio() method (difflib documentation). If the ratio exceeds a threshold, then the sequences are considered a fuzzy match.

Function Call Checks

The following checks are used to validate LLMs/agents that generate function calls.

Function Call AST Validator

Name: function_call_ast_validator.

Validates function calls using the simple AST checker from the Berkeley Function Call Leaderboard repo. The tool call in the model output is compared against the expected structure defined in the scenario result.

Function Call Reference Validator

Name: function_call_reference_validator

Validates function calls by comparing the structure and content of tool calls in the model output against the expected structure defined in the scenario result. It ensures that all required parameters are present and match any specified patterns, supporting nested structures and regex matching for string values.

Do Param Values Match

Name: do_param_values_match

Are All Params expected

Name: are_all_params_expected

Checks if the generated argument names in the function call are expected based on the schema in the scenario_result. A check to ensure that the generated arguments are not hallucinated.

Are Required Params present

Name: are_required_params_present

Checks if the generated arguments in the function call contain the required arguments in the scenario_result.

Is Function Correct

Name: is_function_correct

Checks if the generated function call name(s) in the tool_call matches the expected function call name(s) in the scenario_result.

Code Generation Checks

Does Code Compile

Name: does_code_compile.

This check simply checks whether the generated Python code compiles. This check lets you tell whether the generated code contains any non-Pythonic content (e.g., natural language, HTML, etc.). Requesting the does_code_compile check will run the following evaluate method:

class Check(BaseCheck):
    @staticmethod
    def evaluate(model_output: str) -> bool:
        try:
            compile(model_output, '<string>', 'exec')
            return True
        except SyntaxError as e:
            return False

Code Contains All Imports

Name: contains_all_imports.

This check looks at all the object/function calls in the generated code and ensures that the corresponding import statements are included.

Code Checks​

Custom Code Checks​

Okareo Code Checks​

Natural Language Checks​

Compression Ratio​

Levenshtein Distance​

Reference Checks​

Exact Match​

Fuzzy Match​

Function Call Checks​

Function Call AST Validator​

Function Call Reference Validator​

Do Param Values Match​

Are All Params expected​

Are Required Params present​

Is Function Correct​

Code Generation Checks​

Does Code Compile​

Code Contains All Imports​