Generating Synthetic Scenarios

In addition to driving evaluations, Okareo scenarios can also be used to generate synthetic data. Generated scenarios can be a powerful tool to improve your model evaluation pipeline by allowing you to:

Create new test cases automatically
Ensure robustness to input perturbations/human error

Seed scenarios

To get generate synthetic data in Okareo, you will need to begin with a Seed scenario, so-called since it can serve as the "seed" for Generated scenarios. Any scenario that has been uploaded to or created in Okareo can serve as the Seed for a Generated scenario.

As of now, there are three paths to creating/designating a Seed scenario:

An uploaded file (.jsonl)
A static definition
An existing scenario (Seed or Generated)

Creating seed scenarios

To create a seed scenario with a .jsonl file, you can use the following:

Python
Typescript

seed_scenario = okareo.upload_scenario_set(
    file_path='./path/to/your/file.jsonl', 
    scenario_name="your_scenario_name"
)

const data: any = await okareo.upload_scenario_set({
    file_path: "./path/to/your/file.jsonl",
    scenario_name: "your_scenario_name",
    project_id: project_id
});

To create a seed scenario via a static definition, you can use the following:

Python
Typescript

from okareo_api_client.models import ScenarioSetCreate, SeedData

# list of statically defined seed data
seed_data=[
    SeedData(input_="input1", result="result1"),
    SeedData(input_="input2", result="result2"),
    SeedData(input_="input3", result="result3")
]

# request for scenario set creation 
scenario_set_create = ScenarioSetCreate(
    name="your_static_scenario_name",
    generation_type=ScenarioType.SEED,
    seed_data=seed_data
)

static_scenario = okareo.create_scenario_set(scenario_set_create)

import { Okareo, ScenarioType, SeedData } from 'okareo-ts-sdk';

// request for scenario set creation 
const static_scenario: any = await okareo.create_scenario_set({
    name: "your_static_scenario_name",
    project_id: project_id,
    generation_type: ScenarioType.SEED,
    seed_data: [
        SeedData(input:"input1", result:"result1"),
        SeedData(input:"input2", result:"result2"),
        SeedData(input:"input3", result:"result3")
    ]
});

Finally, to use a previously created scenario as a seed, you can call okareo.generate_scenarios with the proper scenario_id

Python
Typescript

# use the previously generated `static_scenario` to seed another generated scenario

new_generated_scenario = okareo.generate_scenarios(
    source_scenario=static_scenario.scenario_id,
    name="generated_seed_scenario"
)

// use the previously generated `static_scenario` to seed another generated scenario

const new_generated_scenario: any = await okareo.generate_scenario_set(
      {
        project_id: project_id,
        name: "generated_seed_scenario",
        source_scenario_id: static_scenario.scenario_id,
        number_examples: 5,
        generation_type: ScenarioType.NEGATION
      }
    )

Okareo synthetic generators

Assuming you have an existing scenario to use as a Seed, Okareo lets you automatically generate synthetic test cases based on a suite of scenario generators. You can use Okareo's predefined generators or write your own custom generators.

To use a scenario generator, you can use the following templates:

Python
Typescript

from okareo_api_client.models import ScenarioType
# assuming you have an available seed scenario `source_scenario`

okareo.generate_scenarios(
    source_scenario=source_scenario.scenario_id,
    name="generated_scenario",
    num_examples=1,
    generation_type=ScenarioType.REPHRASE_INVARIANT
)

import { Okareo, ScenarioType } from 'okareo-ts-sdk';
// assuming you have an available seed scenario `source_scenario`

const scenario: any = await okareo.generate_scenario_set({
    project_id: project_id,
    name: "generated_scenario",
    source_scenario_id: source_scenario.scenario_id,
    number_examples: 5,
    generation_type: ScenarioType.REPHRASE_INVARIANT
});

For each input in the seed scenario, the generator will attempt to generate num_examples variations of that input.

The generator type is denoted by the ScenarioType enum, and the above example uses the Rephrasing generator. To use a different generator, simply change the enum to a valid ScenarioType in the table below.

Generator	`ScenarioType`	Brief Description
Rephrasing	REPHRASE_INVARIANT	Changes the wording of each sentence per `input`.
Relevant Terms	TERM_RELEVANCE_INVARIANT	Returns relevant/uniquely identifying words from `input`s.
Misspellings	COMMON_MISSPELLINGS	Adds human-like typing errors to `input`s.
Contractions	COMMON_CONTRACTIONS	Removes characters from `input`.
Reverse Questions	TEXT_REVERSE_QUESTION	Creates questions where an `input` contains the relevant answer.
Conditionals	CONDITIONAL	Changes questions in `input`s to emphasize a specific condition.
Synonyms	SYNONYMS	Replace substrings in source scenario with user-specified synonyms.
Custom	CUSTOM_GENERATOR	Generate data based on the user's `generation_prompt`.
Custom Multi-Chunk	CUSTOM_MULTI_CHUNK_GENERATOR	Generate data based on the user's `generation_prompt` and groups of `input`s.

Here, we describe our predefined scenario generators and offer examples of potential use cases. You can try these generators for yourself by checking out scenarios.ipynb.

Rephrasing

The Rephrasing generator rewords each sentence of the input while keeping the same content. This can be useful when you want to ensure that your model returns the same results under semantically identical inputs.

Example

--------Seed #0--------
WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs...
-----Generated #0------
WebBizz prioritizes a smooth digital shopping journey for our customers. Our platform is tailored with straightforward interfaces for easier product browsing and selection...

Relevant Terms

The Relevant Terms generator returns three terms based on tf-idf, meaning the terms are frequent in the the document and relatively less frequent in the larger corpus of the scenario's inputs. This can be useful when you'd like to produce queries based on keywords, a typical pattern that search engine users might use.

Example

--------Seed #2--------
WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.
-----Generated #0------
offers members club

Misspellings

The Misspellings generator lets you create scenarios with human-like errors. This can be useful if your model will be used in a context where inputs are likely to be error-prone. For example, you may be evaluating a model used in a conversational context (e.g., as a customer service chatbot).

Example

--------Seed #0--------
The quick brown fox jumps over the lazy dog
-----Generated #0------
The quick brown fox jumps over the lazt dog
-----Generated #1------
The quick brown fox humps over the lazy dog

Contractions

The Contractions generator attempts to shorten words in a human-like way. Similar to Misspellings, this generator can be beneficial if your model will be seeing conversational inputs.

Example

--------Seed #0--------
The quick brown fox jumps over the lazy dog
-----Generated #0------
The quick brwn fox jumps over the lazy dog

Reverse Questions

The Reverse Question generator poses questions based on the contents of inputs in the seed scenario. This generator is particularly useful when assessing the robustness of a retrieval model.

Suppose you have a database of articles and you would like to generate questions that a user might pose to a chatbot. The Reverse Question generator can help you get coverage on a wide range of questions that potential customers might pose, allowing you to evaluate the chatbot's robustness on corner cases.

Example

--------Seed #0--------
WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.
-----Generated #0------
What features does WebBizz offer to enhance the customer's online shopping experience?

Conditionals

The Conditional generator assumes that the input values are questions and rewords each question to emphasize a particular clause. This can be used in conjunction with the Reverse Question generator to further expand your test coverage in a retrieval scenario.

Example

--------Seed #4--------
What is the primary benefit of joining the WebBizz Rewards program?
-----Generated #0------
Should you decide to join the WebBizz Rewards program, what would be the primary benefit?

Synonyms

The Synonyms generator takes two scenarios as its input:

The seed scenario to modify
The synonym set scenario that defines the groups of synonyms to replace with one another

seed_scenario = okareo_client.create_scenario_set(
    ScenarioSetCreate( 
        name="seed_scenario",
        seed_data=[
            SeedData(input_="the quick brown fox jumps over the lazy dog", result="N/A"),
            SeedData(input_="the rain in spain falls mainly on the plain", result="N/A"),
        ]
    )
)

synonym_scenario = okareo_client.create_scenario_set(
    ScenarioSetCreate( 
        name="synonyms_scenario",
        seed_data=[
            SeedData(input_=["brown", "hazel"], result="N/A"),
            SeedData(input_=["lazy", "lethargic"], result="N/A"),
            SeedData(input_=["plain", "field"], result="N/A"),
        ]
    )
)

scenario_set_generate = ScenarioSetGenerate(
    source_scenario_id=seed_scenario.scenario_id,
    name="my_synonym_scenario",
    scenario_set_id=synonym_scenario.scenario_id,
)

Custom synthetic generators

Custom Generator

The Custom generator allows you to write your own prompts to generate data based on your seed scenario.

seed_scenario = okareo_client.create_scenario_set(
    ScenarioSetCreate( 
        scenario_set_create
        name="seed_scenario",
        seed_data=[
            SeedData(input_="Lorem ipsum dolor sit amet", result="N/A"),
            SeedData(input_="consectetur adipiscing elit, sed do", result="N/A"),
            SeedData(input_="eiusmod tempor incididunt ut labore", result="N/A"),
        ]
    )
)

scenario_set_generate = ScenarioSetGenerate(
    source_scenario_id=seed_scenario.scenario_id,
    name="my_custom_scenario,
    generation_type=ScenarioType.CUSTOM_GENERATOR,
    generation_prompt="generate the next 5 words of 'lorem ipsum' based on the following text: {input}",
)

Custom Multi-chunk Generator

The Custom Multi-chunk generator is an extension of the Custom Generator. The generator tries to group consecutive rows of your scenario together, then uses your prompt to generate new rows based on the grouped rows.

Note: The result field of your scenario should must be a rank-ordered index for the Multi-chunk generator to function properly.

seed_scenario = okareo_client.create_scenario_set(
    ScenarioSetCreate( 
        scenario_set_create
        name="seed_scenario_with_index",
        seed_data=[
            SeedData(input_="Lorem ipsum dolor sit amet", result="1"),
            SeedData(input_="consectetur adipiscing elit, sed do", result="2"),
            SeedData(input_="eiusmod tempor incididunt ut labore", result="3"),
        ]
    )
)

scenario_set_generate = ScenarioSetGenerate(
    source_scenario_id=seed_scenario.scenario_id,
    name="my_custom_multi_chunk_scenario,
    generation_type=ScenarioType.CUSTOM_MULTI_CHUNK_GENERATOR,
    generation_prompt="generate the next 5 words of 'lorem ipsum' based on the following chunks of text: {input}", # 'input' here corresponds to the grouped set of scenario inputs
)

Chaining generators

Composing multiple generators into a chain can help you test different model behaviors. For example, suppose you have trained a retrieval model on user questions. You might want to see if the model performs well based on keyword queries with and without errors. You might set up a chain of generators as follows:

Python
Typescript

# static definition for retrieval questions as seed data
seed_data=[
    SeedData(input_="What type of products does WebBizz offer?", "result"= ["75eaa363-dfcc-499f-b2af-1407b43cb133"])
    ...
]

# upload the seed data
scenario_set_create = ScenarioSetCreate(
    seed_data=seed_data,
    name="Chain Step #1: Seed Questions",
    generation_type=ScenarioType.SEED
)

questions_scenario = okareo.create_scenario_set(scenario_set_create)

# first generator uses uploaded scenario as seed
term_relev_scenario = okareo.generate_scenarios(
    source_scenario=questions_scenario.scenario_id,
    name="Chain Step #2: Term Relevance",
    generation_type=ScenarioType.TERM_RELEVANCE_INVARIANT
)

# second generator uses the first generator's output as a seed
misspellings_scenario = okareo.generate_scenarios(
    source_scenario=term_relev_scenario.scenario_id,
    name="Chain Step #3: Misspellings",
    generation_type=ScenarioType.COMMON_MISSPELLINGS
)

# third generator uses the second generator's output as a seed
contractions_scenario = okareo.generate_scenarios(
    source_scenario=misspellings_scenario.scenario_id,
    name="Chain Step #4: Contractions",
    generation_type=ScenarioType.COMMON_CONTRACTIONS
)

import { Okareo, ScenarioType } from 'okareo-ts-sdk';
// static definition for retrieval questions as seed data

/* Pass data directly or upload a jsonl file in the following format:
* //filename: rag_intent_prompts.jsonl
*  {input:"What type of products does WebBizz offer?", "result": ["75eaa363-dfcc-499f-b2af-1407b43cb133"]}
*  {input:"input_2", "result": ["UUID_2"]}
*  {input:"input_3", "result": ["UUID_3"]}
*  ...
*/

// upload the jsonl seed data
const upload_scenario: any = await okareo.upload_scenario_set({
    file_path: "./path/to/your/file.jsonl",
    scenario_name: "Chain Step #1: Seed Questions",
    project_id: project_id
});

// first generator uses uploaded scenario as seed
const term_relev_scenario: any = await okareo.generate_scenario_set({
    source_scenario_id:upload_scenario.scenario_id,
    name="Chain Step #2: Term Relevance",
    generation_type:ScenarioType.TERM_RELEVANCE_INVARIANT
});

// second generator uses the first generator's output as a seed
misspellings_scenario = okareo.generate_scenarios({
    source_scenario_id:term_relev_scenario.scenario_id,
    name:"Chain Step #3: Misspellings",
    generation_type:ScenarioType.COMMON_MISSPELLINGS
});

# third generator uses the second generator's output as a seed
const contractions_scenario: any = await okareo.generate_scenarios({
    source_scenario:misspellings_scenario.scenario_id,
    name:"Chain Step #4: Contractions",
    generation_type:ScenarioType.COMMON_CONTRACTIONS
});

Now all the steps of the chain are available to use in evaluating your retrieval model.

Data quality checks

Okareo checks can be used to filter synthetic data automatically. We refer to checks used in this situation as "data quality checks" since they help ensure the generated data meets your quality standards.

To apply data quality checks to your generations, you can use the following snippet:

Python

from okareo_api_client.models.scenario_set_generate import ScenarioSetGenerate, ScenarioType
from okareo_api_client.models import ScenarioSetCreate, SeedData
from okareo.checks import CheckOutputType

seed_scenario = okareo_client.create_scenario_set(
    ScenarioSetCreate( 
        scenario_set_create
        name="Webbizz Docs",
        seed_data=[
            SeedData(input_="Webbizz can help you keep track of and share your wish lists.", result="N/A"),
            SeedData(input_="To received travel points, exclusive deals, and other benefits, join Webbizz rewards!", result="N/A"),
            SeedData(input_="With a Webbizz Premier subscription, you can get free shipping and easy returns on all your orders.", result="N/A"),
        ]
    )
)

REVERSE_QA_PROMPT = """Generate a question that can be answered given the following text: {scenario_input}"""

generate_request = ScenarioSetGenerate(
    name="Webbizz Docs - Reverse QA (Data Quality Checks)",
    source_scenario_id=seed_scenario.scenario_id,
    generation_type=ScenarioType.CUSTOM_GENERATOR,
    generation_prompt=REVERSE_QA_PROMPT,
    # These checks will be applied to the generated questions, and 
    # only the ones that pass all checks will be included in the final scenario set.
    checks=[
        # Use a predefined check and apply a threshold 
        # Rows above the threshold will be kept
        # Rows below the threshold will be omitted
        {
            "name": "reverse_qa_quality", #predefined check
            "threshold": 4.0,
        },
        # Define a data quality check in-place
        # Rows that pass this check will be kept
        # Rows that fail this check will be omitted
        {
            "name": "is_tool_question_relevant", #custom check
            "description": "Check if the question is related to the provided tool definition.",
            "check_config": {
                "prompt_template": FUNCTION_CALL_QC,
                "type": CheckOutputType.PASS_FAIL.value,
            },
        }
        # If you've already defined a check, you can use the following
        # {
        #     "name": "is_tool_question_relevant", # previously defined custom check
        # }
    ]
)

generated_scenario = okareo.generate_scenario_set(generate_request)

# rows that passed all provided data quality checks
print(generated_scenario.scenario_data)

# rows that failed one or more of the data quality checks
print(generated_scenario.scenario_data)

Seed scenarios​

Creating seed scenarios​

Okareo synthetic generators​

Rephrasing​

Relevant Terms​

Misspellings​

Contractions​

Reverse Questions​

Conditionals​

Synonyms​

Custom synthetic generators​

Custom Generator​

Custom Multi-chunk Generator​

Chaining generators​

Data quality checks​

Seed scenarios

Creating seed scenarios

Okareo synthetic generators

Rephrasing

Relevant Terms

Misspellings

Contractions

Reverse Questions

Conditionals

Synonyms

Custom synthetic generators

Custom Generator

Custom Multi-chunk Generator

Chaining generators

Data quality checks