Checks
What is a check?
In Okareo, a check is a unit of code that scores a generative model's output. The intention of a check is to assess a particular behavior of your LLM. Checks offer a number of benefits:
- While LLMs may behave stochastically, checks are deterministic, letting you objectively assess your model's performance.
- Checks can be narrowly scoped to assess specific model behaviors, letting you incorporate domain knowledge into your evaluation.
With checks, you can answer behavioral questions like:
- Did the check pass? Was the check's threshold exceeded?
- In what situations did this check fail?
- Did the check change between Version A and Version B of my model?
Cookbook examples that showcase Okareo checks are available here:
- Colab Notebook
- Typescript Cookbook - Coming Soon
Creating or Updating Checks
Okareo provides a create_or_update_check
method to create new checks or update existing ones. This method allows you to define checks using either a ModelBasedCheck
or a CodeBasedCheck
.
Types of Checks
ModelBasedCheck
A ModelBasedCheck
uses a prompt template to evaluate the data. It's particularly useful when you want to leverage an existing language model to perform the evaluation.
How to use ModelBasedCheck
Here's an example of how to create a check using ModelBasedCheck
:
check_sample_score = okareo.create_or_update_check(
name="check_sample_score",
description="Check sample score",
check=ModelBasedCheck(
prompt_template="Only output the number of words in the following text: {scenario_input} {output} {model_output}",
check_type=CheckOutputType.SCORE,
),
)
In this example:
name
: A unique identifier for the check.description
: A brief description of what the check does.check
: An instance ofModelBasedCheck
.prompt_template
: A string that includes placeholders (input, output, generation) which will be replaced with actual values.check_type
: Specifies the type of output (SCORE or PASS_FAIL).
The prompt_template
should include at least one of the following placeholders:
generation
: corresponds to the model's outputinput
: corresponds to the scenario inputresult
: corresponds to the scenario result
The check_type
should be one of:
CheckOutputType.SCORE
: The template should prompt the model for a score (single number)CheckOutputType.PASS_FAIL
: The template should prompt the model for a boolean value (True/False)
CodeBasedCheck
A CodeBasedCheck
uses custom code to evaluate the data. This is useful when you need more complex logic or want to incorporate domain-specific knowledge into your check.
How to use CodeBasedCheck
To use a CodeBasedCheck
:
- Create a new Python file (not in a notebook).
- In this file, define a class named 'Check' that inherits from CodeBasedCheck.
- Implement the
evaluate
method in your Check class. - Include any additional code used by your check in the same file.
Here's an example:
# In my_custom_check.py
from okareo.checks import CodeBasedCheck
class Check(CodeBasedCheck):
@staticmethod
def evaluate(
model_output: str, scenario_input: str, scenario_result: str
) -> Union[bool, int, float]:
# Your evaluation logic here
word_count = len(model_output.split())
return word_count > 10 # Returns True if output has more than 10 words
Then, you can create or update the check using:
check_sample_code = okareo.create_or_update_check(
name="check_sample_code",
description="Check if output has more than 10 words",
check=Check(),
)
The evaluate
method should accept model_output
, scenario_input
, and scenario_result
as arguments and return either a boolean, integer, or float.
Okareo checks
In Okareo, we provide out-of-the-box checks to let you quickly assess your LLM's performance. In the Okareo SDK, you can list the available checks by running the following method:
- Typescript
- Python
okareo.get_all_checks()
okareo.get_all_checks()
To use any of these checks, you simply specify them when running an evaluation as follows:
- Typescript
- Python
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
// assume that "scenario" is a ScenarioSetResponse object or a UUID
const eval_results: any = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: 'Evaluation Name',
tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario_id: scenario_id,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: checks,
} as RunTestProps);
checks = ['check_name_1', 'check_name_2', ..., 'check_name_N',]
# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
name="Evaluation Name",
scenario=scenario,
project_id=project_id,
test_run_type=TestRunType.NL_GENERATION,
calculate_metrics=True,
checks=checks
)
You can track your LLM's behaviors using:
- Off-the-shelf Checks provided by Okareo
- Publicly available Checks from the community Checklist (working title)
- Custom Checks generated based on user-specific application requirements
As of now, the following out-of-the box checks are available in Okareo:
conciseness
uniqueness
fluency
/fluency_summary
coherence
/coherence_summary
consistency
/consistency_summary
relevance
/relevance_summary
does_code_compile
contains_all_imports
compression_ratio
levenshtein_distance
/levenshtein_distance_input
Automatic checks
The following checks make use of an LLM judge. The judge is provided with a system prompt describing the check in question. Each check is a measure of quality of the natural language text rated on an integer scale ranging from 1 to 5.
Any Automatic check that includes _summary
in its name makes use of the scenario_input
in addition to the model_output
.
Conciseness
Name: conciseness
.
The conciseness check rates the text on how concise the generated output is. If the model's output contains repeated ideas, the score will be lower.
Uniqueness
Name: uniqueness
.
The uniqueness check will rate the text on how unique the text is compared to the other outputs in the evaluation. Consequently, this check uses all the rows in the scenario to score each row individually.
Fluency
Names: fluency
, fluency_summary
.
The fluency check is a measure of quality based on grammar, spelling, punctuation, word choice, and sentence structure. This check does not require a scenario input or result.
Coherence
Names: coherence
, coherence_summary
.
The coherence check is a measurement of structure and organization in a model's output. A higher score indicates that the output is well-structured and organized, and a lower score indicates the opposite.
Consistency
Names: consistency
, consistency_summary
.
The consistency check is a measurement of factual accuracy between the model output and the scenario input. This is useful in summarization tasks, where the model's input is a target document and the model's output is a summary of the target document.
Relevance
Names: relevance
, relevance_summary
.
The relevance check is a measure of summarization quality that rewards highly relevant information and penalizes redundancies or irrelevant information.
Code Generation checks
Does Code Compile
Name: does_code_compile
.
This check simply checks whether the generated Python code compile
s. This check lets you tell whether the generated code contains any non-Pythonic content (e.g., natural language, HTML, etc.). Requesting the does_code_compile
check will run the following evaluate
method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str) -> bool:
try:
compile(model_output, '<string>', 'exec')
return True
except SyntaxError as e:
return False
Code Contains All Imports
Name: contains_all_imports
.
This check looks at all the object/function calls in the generated code and ensures that the corresponding import
statements are included.
Natural Language checks
Compression Ratio
Name: compression_ratio
.
The compression ratio is a measure of how much smaller (or larger) a generated text is compared with a scenario input. In Okareo, requesting the compression_ratio
check will invoke the following evaluate
method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str, scenario_input: str) -> float:
return len(model_output) / len(scenario_input)
Levenshtein Distance
Names: levenshtein_distance
, levenshtein_distance_input
.
The Levenshtein distance measures the amount of edits made to a given string where an "edit" can mean either an addition, a deletion, or a substitution. In Okareo, requesting the levenshtein_distance
check will invoke the following evaluate
method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str, scenario_response: str):
# use Levenshtein distance with uniform weights
weights = [1, 1, 1]
return levenshtein_distance(model_output, scenario_response, weights)
def levenshtein_distance(s1, s2, weights):
if len(s1) < len(s2):
return levenshtein_distance(s2, s1, weights)
if len(s2) == 0:
return len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + weights[0]
deletions = current_row[j] + weights[1]
substitutions = previous_row[j] + (c1 != c2) * weights[2]
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
Similarly, the levenshtein_distance_input
call will use the following evaluate
method:
class Check(BaseCheck):
@staticmethod
def evaluate(model_output: str, scenario_input: str):
# use Levenshtein distance with uniform weights
weights = [1, 1, 1]
return levenshtein_distance(model_output, scenario_input, weights)
Custom checks
If the out-of-the-box checks do not serve your needs, then you can generate and upload your own Python-based checks.
Generating checks
To help you create your own checks, the Okareo SDK provides the generate_check
method. You can describe the logic of your check using natural language, and an LLM will generate an evaluate
method meeting those requirements.
For example, we can try to generate a check that looks for natural language below.
To help you create your own checks, the Okareo SDK provides the generate_check
method. You can describe the logic of your check using natural language, and an LLM will generate an evaluate
method meeting those requirements.
For example, we can try to generate a check that looks for natural language below.
- Typescript
- Python
const generated_check = await okareo.generate_check({
project_id,
name: "demo.summaryUnder256",
description: "Pass if model_output contains at least one line of natural language.",
output_data_type: "bool",
requires_scenario_input:true,
requires_scenario_result:true,
});
return await okareo.upload_check({
project_id,
...generated_check
} as UploadEvaluatorProps);
from okareo_api_client.models.evaluator_spec_request import EvaluatorSpecRequest
description = """
Return `False` if `model_output` contains at least one line of natural language.
Otherwise, return `True`.
"""
generate_request = EvaluatorSpecRequest(
description=description,
requires_scenario_input=False,
requires_scenario_result=False,
output_data_type="bool"
)
generated_test = okareo.generate_check(generate_request).generated_code
Please ensure that requires_scenario_input
and requires_scenario_result
are correctly configured for your check.
For example, if your check relies on the scenario_input
, then you should set requires_scenario_input=True
.
Uploading checks
Given a generated check, the Okareo SDK provides the upload_check
method, which allows you to run custom check
s in Okareo.
- Typescript
- Python
const upload_check: any = await okareo.upload_check({
name: 'Example Uploaded Check',
project_id,
description: "Pass if the model result length is within 10% of the expected result.",
requires_scenario_input: false,
requires_scenario_result: true,
output_data_type: "bool",
file_path: "tests/example_eval.py",
update: true
});
import tempfile
check_name = "has_no_natural_language"
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, f"{check_name}.py")
with open(file_path, "w+") as file:
file.write(generated_test)
has_no_nl_check = okareo.upload_check(
name=check_name,
file_path=file_path,
requires_scenario_input=False,
requires_scenario_result=False
)
Your evaluate
function must be saved locally as a .py
file, and the file_path
should point to this .py
file.
Evaluating with uploaded checks
Once the check
has been uploaded, you can use the check in a model_under_test.run_test
by adding the name or the ID of the check to your list of checks
. For example:
- Typescript
- Python
// provide a list of checks by name or ID
const eval_results: any = await model.run_test({
model_api_key: OPENAI_API_KEY,
name: 'Evaluation Name',
tags: ["Example", `Build:${UNIQUE_BUILD_ID}`],
project_id: project_id,
scenario_id: scenario_id,
calculate_metrics: true,
type: TestRunType.NL_GENERATION,
checks: [
"check_name_1",
"check_name_2",
...
],
} as RunTestProps);
checks = [check_name] # alternatively: has_no_nl_check.id
# assume that "scenario" is a ScenarioSetResponse object or a UUID
evaluation = model_under_test.run_test(
name="Evaluation Name",
scenario=scenario,
test_run_type=TestRunType.NL_GENERATION,
calculate_metrics=True,
checks=checks
)