Skip to main content

Get Started with Generations

With the availability of LLMs the generation of content has become easier than ever. However there are many dimensions in which this can go wrong - bias, hallucination, mis-representation, format, language, and many more.

What do you need?

You will need an environment for running Okareo. Typescript and Python are both available. Please see the SDK sections for more on how to setup each.

Cookbook examples for this guide are available:

Example Generation Using OpenAI

In this example, we will use OpenAI to summarize text and then score the output and provide lexical and semantic comparison based on metrics unique to evaluation of LLM generation.

note

Download a complete Jupyter notebook - generation_eval.ipynb
Or, start from an Okareo cookbook for Typescript + Jest

Step 1: Setup Okareo and OpenAI

Make sure you have the API keys for Okareo and OpenAI available. We suggest making the keys available through environment variables named OKAREO_API_KEY and OPENAI_API_KEY.

Step 2: Register the Generation Model

Register a Model: Models can be shared across evaluation runs. To make this easier, all models can be referred to by name. It also means model names must be unique and that once they are defined, they can not be modified. Don't worry, it is just meta-data. You can make as many as you need.


Setup the generation prompt that you will use with OpenAI.

# Simple generation prompt for use with OpenAI'a GPT 3.5 Turbo model
USER_PROMPT_TEMPLATE = "{input}"

SUMMARIZATION_CONTEXT_TEMPLATE = """
You will be provided with text.
Summarize the text in 1 simple sentence.
"""

Now, register the model with the context prompt that you will use for evaluation. Registered models can be reused across multiple scenarios or behavior checks.

# Evaluate the scenario and model combination and then get a link to the results on Okareo
from okareo import Okareo
from okareo.model_under_test import OpenAIModel

okareo = Okareo(OKAREO_API_KEY)
mut_name = "Example Generation Model"

model_under_test = okareo.register_model(
name=mut_name,
model=OpenAIModel(
model_id="gpt-3.5-turbo",
temperature=0,
system_prompt_template=SUMMARIZATION_CONTEXT_TEMPLATE,
user_prompt_template=USER_PROMPT_TEMPLATE,
),
)

Step 3: Create a Scenario to Evaluate

Create Scenario: Scenarios are uploaded or are created synthetically within Okareo. The example here demonstrates how to upload jsonl.


At scale, it is more common to check-in and upload jsonl directly in CI.

import os
import tempfile

webbizz_articles = os.popen('curl https://raw.githubusercontent.com/okareo-ai/okareo-python-sdk/main/examples/webbizz_10_articles.jsonl').read()
temp_dir = tempfile.gettempdir()
file_path = os.path.join(temp_dir, "webbizz_10_articles.jsonl")
with open(file_path, "w+") as file:
lines = webbizz_articles.split('\n')
# Use the first 3 json objects to make a scenario set with 3 scenarios
for i in range(3):
file.write(f"{lines[i]}\n")
scenario = okareo.upload_scenario_set(file_path=file_path, scenario_name="Webbizz Articles Scenario")

# make sure to clean up tmp file
os.remove(file_path)

print(f"https://app.okareo.com/project/{scenario.project_id}/scenario/{scenario.scenario_id}")

Step 4: Evaluate the Scenario

Evaluation: Okareo has a built-in test harness for running evaluations directly from the cloud. This makes it easy to run quick or long tests from CI or from your local workspace.

# Evaluate the scenario and model combination and then get a link to the results on Okareo
from okareo_api_client.models.test_run_type import TestRunType

eval_name = "Example Generation"

evaluation = model_under_test.run_test(
name=eval_name,
scenario=scenario,
api_key=OPENAI_API_KEY,
test_run_type=TestRunType.NL_GENERATION,
calculate_metrics=True,
)

print(f"See results in Okareo: {evaluation.app_link}")

Step 5: Review Results

Results: Navigate to your last evaluation either within app.okareo.com or directly from the link generated in the example to view evaluation results.

Okareo automaticlly calculates metrics and provides an error matrix to compare expected to actual results for evaluations identified as NL_GENERATION.

Okareo Diagram