Introduction to Checks and Metrics
Okareo evaluations span a range of input and output data types, and consequently, we use different metrics to measure the performance across these evaluation types. The following table summarizes the metrics used for each evaluation type.
Evaluation Type | Metrics |
---|---|
Generation | Checks |
Simulation | Checks |
Retrieval | Precision@k, Recall@k, MRR, MAP |
Classification | Accuracy, Precision, Recall, F1 |
While the metrics under Retrieval and Classification will be familiar to data scientists/machine learning practitioners, checks are unique to Okareo. We provide more details below.
What is a check?
In Okareo, a check is a mechanism for scoring a generative model's output. A check can be narrowly tailored to assess a particular behavior of your LLM.
With checks, you can answer behavioral questions like:
- Did the check pass? Was the check's threshold exceeded?
- In what situations did this check fail?
- Did the check change between Version A and Version B of my model?
Cookbook examples that showcase Okareo checks are available here:
- Colab Notebook
- Typescript Cookbook (Clone
okareo-cookbook
repo and download theokareo-cli
)
Okareo provides predefined checks that let you bootstrap your generative evaluations and simulations. Additionally, you can upload custom checks to Okareo to tailor your evaluations to your specific needs.