Skip to main content

Generating Synthetic Scenarios

In addition to driving evaluations, Okareo scenarios can also be used to generate synthetic data. Generated scenarios can be a powerful tool to improve your model evaluation pipeline by allowing you to:

  • Create new test cases automatically
  • Ensure robustness to input perturbations/human error

Seed scenarios

To get generate synthetic data in Okareo, you will need to begin with a Seed scenario, so-called since it can serve as the "seed" for Generated scenarios. Any scenario that has been uploaded to or created in Okareo can serve as the Seed for a Generated scenario.

As of now, there are three paths to creating/designating a Seed scenario:

  1. An uploaded file (.jsonl)
  2. A static definition
  3. An existing scenario (Seed or Generated)

Creating seed scenarios

To create a seed scenario with a .jsonl file, you can use the following:

seed_scenario = okareo.upload_scenario_set(
file_path='./path/to/your/file.jsonl',
scenario_name="your_scenario_name"
)

To create a seed scenario via a static definition, you can use the following:

from okareo_api_client.models import ScenarioSetCreate, SeedData

# list of statically defined seed data
seed_data=[
SeedData(input_="input1", result="result1"),
SeedData(input_="input2", result="result2"),
SeedData(input_="input3", result="result3")
]

# request for scenario set creation
scenario_set_create = ScenarioSetCreate(
name="your_static_scenario_name",
generation_type=ScenarioType.SEED,
seed_data=seed_data
)

static_scenario = okareo.create_scenario_set(scenario_set_create)

Finally, to use a previously created scenario as a seed, you can call okareo.generate_scenarios with the proper scenario_id

# use the previously generated `static_scenario` to seed another generated scenario

new_generated_scenario = okareo.generate_scenarios(
source_scenario=static_scenario.scenario_id,
name="generated_seed_scenario"
)

Okareo synthetic generators

Assuming you have an existing scenario to use as a Seed, Okareo lets you automatically generate synthetic test cases based on a suite of scenario generators. You can use Okareo's predefined generators or write your own custom generators.

To use a scenario generator, you can use the following templates:

from okareo_api_client.models import ScenarioType
# assuming you have an available seed scenario `source_scenario`

okareo.generate_scenarios(
source_scenario=source_scenario.scenario_id,
name="generated_scenario",
num_examples=1,
generation_type=ScenarioType.REPHRASE_INVARIANT
)

For each input in the seed scenario, the generator will attempt to generate num_examples variations of that input.

The generator type is denoted by the ScenarioType enum, and the above example uses the Rephrasing generator. To use a different generator, simply change the enum to a valid ScenarioType in the table below.

GeneratorScenarioTypeBrief Description
RephrasingREPHRASE_INVARIANTChanges the wording of each sentence per input.
Relevant TermsTERM_RELEVANCE_INVARIANTReturns relevant/uniquely identifying words from inputs.
MisspellingsCOMMON_MISSPELLINGSAdds human-like typing errors to inputs.
ContractionsCOMMON_CONTRACTIONSRemoves characters from input.
Reverse QuestionsTEXT_REVERSE_QUESTIONCreates questions where an input contains the relevant answer.
ConditionalsCONDITIONALChanges questions in inputs to emphasize a specific condition.
SynonymsSYNONYMSReplace substrings in source scenario with user-specified synonyms.
CustomCUSTOM_GENERATORGenerate data based on the user's generation_prompt.
Custom Multi-ChunkCUSTOM_MULTI_CHUNK_GENERATORGenerate data based on the user's generation_prompt and groups of inputs.

Here, we describe our predefined scenario generators and offer examples of potential use cases. You can try these generators for yourself by checking out scenarios.ipynb.

Rephrasing

The Rephrasing generator rewords each sentence of the input while keeping the same content. This can be useful when you want to ensure that your model returns the same results under semantically identical inputs.

Example

--------Seed #0--------
WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs...
-----Generated #0------
WebBizz prioritizes a smooth digital shopping journey for our customers. Our platform is tailored with straightforward interfaces for easier product browsing and selection...

Relevant Terms

The Relevant Terms generator returns three terms based on tf-idf, meaning the terms are frequent in the the document and relatively less frequent in the larger corpus of the scenario's inputs. This can be useful when you'd like to produce queries based on keywords, a typical pattern that search engine users might use.

Example

--------Seed #2--------
WebBizz places immense value on its dedicated clientele, recognizing their loyalty through the exclusive 'Premium Club' membership. This special program is designed to enrich the shopping experience, providing a suite of benefits tailored to our valued members. Among the advantages, members enjoy complimentary shipping, granting them a seamless and cost-effective way to receive their purchases. Additionally, the 'Premium Club' offers early access to sales, allowing members to avail themselves of promotional offers before they are opened to the general public.
-----Generated #0------
offers members club

Misspellings

The Misspellings generator lets you create scenarios with human-like errors. This can be useful if your model will be used in a context where inputs are likely to be error-prone. For example, you may be evaluating a model used in a conversational context (e.g., as a customer service chatbot).

Example

--------Seed #0--------
The quick brown fox jumps over the lazy dog
-----Generated #0------
The quick brown fox jumps over the lazt dog
-----Generated #1------
The quick brown fox humps over the lazy dog

Contractions

The Contractions generator attempts to shorten words in a human-like way. Similar to Misspellings, this generator can be beneficial if your model will be seeing conversational inputs.

Example

--------Seed #0--------
The quick brown fox jumps over the lazy dog
-----Generated #0------
The quick brwn fox jumps over the lazy dog

Reverse Questions

The Reverse Question generator poses questions based on the contents of inputs in the seed scenario. This generator is particularly useful when assessing the robustness of a retrieval model.

Suppose you have a database of articles and you would like to generate questions that a user might pose to a chatbot. The Reverse Question generator can help you get coverage on a wide range of questions that potential customers might pose, allowing you to evaluate the chatbot's robustness on corner cases.

Example

--------Seed #0--------
WebBizz is dedicated to providing our customers with a seamless online shopping experience. Our platform is designed with user-friendly interfaces to help you browse and select the best products suitable for your needs. We offer a wide range of products from top brands and new entrants, ensuring diversity and quality in our offerings. Our 24/7 customer support is ready to assist you with any queries, from product details, shipping timelines, to payment methods. We also have a dedicated FAQ section addressing common concerns. Always ensure you are logged in to enjoy personalized product recommendations and faster checkout processes.
-----Generated #0------
What features does WebBizz offer to enhance the customer's online shopping experience?

Conditionals

The Conditional generator assumes that the input values are questions and rewords each question to emphasize a particular clause. This can be used in conjunction with the Reverse Question generator to further expand your test coverage in a retrieval scenario.

Example

--------Seed #4--------
What is the primary benefit of joining the WebBizz Rewards program?
-----Generated #0------
Should you decide to join the WebBizz Rewards program, what would be the primary benefit?

Synonyms

The Synonyms generator takes two scenarios as its input:

  1. The seed scenario to modify
  2. The synonym set scenario that defines the groups of synonyms to replace with one another
seed_scenario = okareo_client.create_scenario_set(
ScenarioSetCreate(
name="seed_scenario",
seed_data=[
SeedData(input_="the quick brown fox jumps over the lazy dog", result="N/A"),
SeedData(input_="the rain in spain falls mainly on the plain", result="N/A"),
]
)
)

synonym_scenario = okareo_client.create_scenario_set(
ScenarioSetCreate(
name="synonyms_scenario",
seed_data=[
SeedData(input_=["brown", "hazel"], result="N/A"),
SeedData(input_=["lazy", "lethargic"], result="N/A"),
SeedData(input_=["plain", "field"], result="N/A"),
]
)
)

scenario_set_generate = ScenarioSetGenerate(
source_scenario_id=seed_scenario.scenario_id,
name="my_synonym_scenario",
scenario_set_id=synonym_scenario.scenario_id,
)

Custom synthetic generators

Custom Generator

The Custom generator allows you to write your own prompts to generate data based on your seed scenario.

seed_scenario = okareo_client.create_scenario_set(
ScenarioSetCreate(
scenario_set_create
name="seed_scenario",
seed_data=[
SeedData(input_="Lorem ipsum dolor sit amet", result="N/A"),
SeedData(input_="consectetur adipiscing elit, sed do", result="N/A"),
SeedData(input_="eiusmod tempor incididunt ut labore", result="N/A"),
]
)
)

scenario_set_generate = ScenarioSetGenerate(
source_scenario_id=seed_scenario.scenario_id,
name="my_custom_scenario,
generation_type=ScenarioType.CUSTOM_GENERATOR,
generation_prompt="generate the next 5 words of 'lorem ipsum' based on the following text: {input}",
)

Custom Multi-chunk Generator

The Custom Multi-chunk generator is an extension of the Custom Generator. The generator tries to group consecutive rows of your scenario together, then uses your prompt to generate new rows based on the grouped rows.

Note: The result field of your scenario should must be a rank-ordered index for the Multi-chunk generator to function properly.

seed_scenario = okareo_client.create_scenario_set(
ScenarioSetCreate(
scenario_set_create
name="seed_scenario_with_index",
seed_data=[
SeedData(input_="Lorem ipsum dolor sit amet", result="1"),
SeedData(input_="consectetur adipiscing elit, sed do", result="2"),
SeedData(input_="eiusmod tempor incididunt ut labore", result="3"),
]
)
)

scenario_set_generate = ScenarioSetGenerate(
source_scenario_id=seed_scenario.scenario_id,
name="my_custom_multi_chunk_scenario,
generation_type=ScenarioType.CUSTOM_MULTI_CHUNK_GENERATOR,
generation_prompt="generate the next 5 words of 'lorem ipsum' based on the following chunks of text: {input}", # 'input' here corresponds to the grouped set of scenario inputs
)

Chaining generators

Composing multiple generators into a chain can help you test different model behaviors. For example, suppose you have trained a retrieval model on user questions. You might want to see if the model performs well based on keyword queries with and without errors. You might set up a chain of generators as follows:

# static definition for retrieval questions as seed data
seed_data=[
SeedData(input_="What type of products does WebBizz offer?", "result"= ["75eaa363-dfcc-499f-b2af-1407b43cb133"])
...
]

# upload the seed data
scenario_set_create = ScenarioSetCreate(
seed_data=seed_data,
name="Chain Step #1: Seed Questions",
generation_type=ScenarioType.SEED
)

questions_scenario = okareo.create_scenario_set(scenario_set_create)

# first generator uses uploaded scenario as seed
term_relev_scenario = okareo.generate_scenarios(
source_scenario=questions_scenario.scenario_id,
name="Chain Step #2: Term Relevance",
generation_type=ScenarioType.TERM_RELEVANCE_INVARIANT
)

# second generator uses the first generator's output as a seed
misspellings_scenario = okareo.generate_scenarios(
source_scenario=term_relev_scenario.scenario_id,
name="Chain Step #3: Misspellings",
generation_type=ScenarioType.COMMON_MISSPELLINGS
)

# third generator uses the second generator's output as a seed
contractions_scenario = okareo.generate_scenarios(
source_scenario=misspellings_scenario.scenario_id,
name="Chain Step #4: Contractions",
generation_type=ScenarioType.COMMON_CONTRACTIONS
)

Now all the steps of the chain are available to use in evaluating your retrieval model.

Data quality checks

Okareo checks can be used to filter synthetic data automatically. We refer to checks used in this situation as "data quality checks" since they help ensure the generated data meets your quality standards.

To apply data quality checks to your generations, you can use the following snippet:

from okareo_api_client.models.scenario_set_generate import ScenarioSetGenerate, ScenarioType
from okareo_api_client.models import ScenarioSetCreate, SeedData
from okareo.checks import CheckOutputType

seed_scenario = okareo_client.create_scenario_set(
ScenarioSetCreate(
scenario_set_create
name="Webbizz Docs",
seed_data=[
SeedData(input_="Webbizz can help you keep track of and share your wish lists.", result="N/A"),
SeedData(input_="To received travel points, exclusive deals, and other benefits, join Webbizz rewards!", result="N/A"),
SeedData(input_="With a Webbizz Premier subscription, you can get free shipping and easy returns on all your orders.", result="N/A"),
]
)
)

REVERSE_QA_PROMPT = """Generate a question that can be answered given the following text: {scenario_input}"""

generate_request = ScenarioSetGenerate(
name="Webbizz Docs - Reverse QA (Data Quality Checks)",
source_scenario_id=seed_scenario.scenario_id,
generation_type=ScenarioType.CUSTOM_GENERATOR,
generation_prompt=REVERSE_QA_PROMPT,
# These checks will be applied to the generated questions, and
# only the ones that pass all checks will be included in the final scenario set.
checks=[
# Use a predefined check and apply a threshold
# Rows above the threshold will be kept
# Rows below the threshold will be omitted
{
"name": "reverse_qa_quality", #predefined check
"threshold": 4.0,
},
# Define a data quality check in-place
# Rows that pass this check will be kept
# Rows that fail this check will be omitted
{
"name": "is_tool_question_relevant", #custom check
"description": "Check if the question is related to the provided tool definition.",
"check_config": {
"prompt_template": FUNCTION_CALL_QC,
"type": CheckOutputType.PASS_FAIL.value,
},
}
# If you've already defined a check, you can use the following
# {
# "name": "is_tool_question_relevant", # previously defined custom check
# }
]
)

generated_scenario = okareo.generate_scenario_set(generate_request)

# rows that passed all provided data quality checks
print(generated_scenario.scenario_data)

# rows that failed one or more of the data quality checks
print(generated_scenario.scenario_data)