Skip to main content

Voice Augmentation

An agent that passes every test under clean audio conditions may fail the same tests under realistic ones. Voice benchmarking research consistently shows a pass-rate drop between clean and noisy environments, and the drop isn't just about speech recognition accuracy. Background noise, interruptions, and cross-talk test whether the agent's conversation logic holds up: turn detection, recovery from barge-in, tool-calling sequences under garbled input, and handling off-mic speech.

Augmentation is how you create those conditions. It injects realistic audio effects (noise, interruptions, side conversations) into your voice runs so you can run the same test suite under controlled adversity.

What Augmentation Does

Each augmentation strategy modifies one aspect of the audio or conversation flow during the simulation:

StrategyWhat it doesWhen to use
noiseMixes background noise (cafeteria, traffic, etc.) into the caller's audio at a configured SNRValidate ASR robustness; baseline of every realistic run
barge_inThe caller interrupts the agent mid-reply with a short promptTest the agent's interruption handling and recovery
backchannelCaller utters a short acknowledgement ("mm-hmm", "right") while the agent speaksTest the agent's turn-detection logic
directed_speechCaller turns away mid-call to address someone in their environmentTest the agent against off-mic cross-talk
secondary_speakerA second voice in the room interjects with on-topic commentaryTest multi-speaker robustness and confusion handling
capDriver emits multiple consecutive messages per turn (Concurrent Ask Probability)Stress-test agent behavior when the caller sends a burst of messages

Composition Rule

Augmentation is configured with the augmentation parameter of okareo.run_simulation(...). Composition is constrained:

You may use at most one strategy, or noise combined with one other strategy.

Valid combos: noise alone, barge_in alone, noise + barge_in, noise + backchannel, etc. Invalid: barge_in + backchannel, or any three strategies together.

Configuring Augmentation

In the App

In the simulation form, once a voice target is selected, open Advanced Settings and scroll to Voice Simulation Settings:

  1. Background Noise: Select a profile from the dropdown (Cafeteria, Classroom, Office Babble, Traffic, or None), then set the Signal to Noise Ratio (-5 to 25 dB). Lower SNR means more noise relative to the speaker.

  2. Voice Augmentation: Pick one strategy from the chips: Barge-In, Backchannel, Directed Speech, Secondary Speaker, or Concurrent Ask. Each strategy shows its own parameter fields when selected.

  3. Background noise and the augmentation strategy are independent controls. Combine noise with one strategy (e.g. noise + barge-in); you cannot select two strategies at once.

Voice Simulation Settings with a Background Noise profile, SNR field, and the Barge-In augmentation chip selectedVoice Simulation Settings with a Background Noise profile, SNR field, and the Barge-In augmentation chip selected
UI vs SDK defaults

Some parameter defaults differ between the UI form and the server/SDK. Where they differ, the strategy parameter tables below note the server defaults. If you need the same behavior across both paths, set values explicitly.

From the SDK

Pass an Augmentation container to okareo.run_simulation(...). Each strategy has a typed config class in okareo.augmentations; set one strategy field, or noise plus one other.

from okareo.augmentations import Augmentation, BargeInAugmentation, NoiseAugmentation

result = okareo.run_simulation(
name="Noise + Barge-In",
target=target,
scenario=scenario,
driver=driver,
max_turns=5,
first_turn="driver",
checks=["avg_turn_taking_latency", "result_completed"],
augmentation=Augmentation(
noise=NoiseAugmentation(profile="cafeteria", snr_db=10),
barge_in=BargeInAugmentation(
probability=0.5,
min_offset_ms=200,
max_offset_ms=600,
prompt="Ask for a very short polite interruption.",
),
),
)
print(f"Results: {result.app_link}")

The strategy classes are NoiseAugmentation, BargeInAugmentation, BackchannelAugmentation, DirectedSpeechAugmentation, SecondarySpeakerAugmentation, and CAPAugmentation, matching the strategy keys in the table above.

Strategy Parameters

Parameter names, types, and defaults below are from the server strategy constructors (the source of truth for what the API accepts).

noise

ParameterTypeValid valuesDefault
noise_profilestringcafeteria, classroom, office_babble, trafficrequired (no default)
noise_snr_dbnumber-5 to 2510
seedint or nullAny integer for reproducibilitynull

Lower SNR = more noise relative to the speaker. 10 is moderate; try 0 to 5 for stress conditions.

tip

The UI form defaults noise_profile to cafeteria, but the API requires you to specify it explicitly. Omitting it returns a 400 error.

Aliases accepted: profilenoise_profile, snr_dbnoise_snr_db. The SDK's NoiseAugmentation(profile=..., snr_db=...) uses the alias names.

barge_in

ParameterTypeRange / NotesDefault
promptstringLLM instruction for the interruption phrasingrequired
probabilitynumber0 to 1 (per agent turn)0.2
min_offset_msint0 or above200
max_offset_msintat least min_offset_ms600
seedint or nullReproducibilitynull

Probability is per agent reply; min_offset_ms / max_offset_ms is the delay after the agent starts speaking before the interruption fires.

backchannel

ParameterTypeRange / NotesDefault
utterancestringNon-empty after trimmm-hmm
probabilitynumber0 to 10.35
min_offset_msint0 or above150
max_offset_msintat least min_offset_ms450
seedint or nullReproducibilitynull

Backchannels repeat while the target is speaking. They do not consume a turn.

tip

The UI form uses different offset defaults (1000/10000). The values above are the server defaults. When calling via SDK, specify offsets explicitly if you want the wider window.

directed_speech

ParameterTypeRange / NotesDefault
promptstring or nullLLM instruction for the off-mic remark(built-in template)
probabilitynumber0 to 10.3
lpf_cutoff_hznumber1 to 20000800
gain_dbnumber-40 to 0-8

Low-pass filter and gain attenuation simulate the muffled, off-mic quality of someone turning their head away from the phone.

secondary_speaker

ParameterTypeRange / NotesDefault
secondary_promptstring or nullLLM instruction for the second speaker's interjection(built-in template)
secondary_voicestringNamed voice from Okareo's voice catalogCathy - Coworker
probabilitynumber0 to 10.3
inter_speaker_pause_msint0 to 5000120
lpf_cutoff_hznumber or null1 to 20000 (null disables)800
gain_dbnumber-40 to 0-8

Aliases accepted: promptsecondary_prompt, voicesecondary_voice. The SDK's SecondarySpeakerAugmentation(voice=..., prompt=...) uses the alias names.

tip

The UI form defaults to Carson - Curious Conversationalist. The server default is Cathy - Coworker. Specify secondary_voice explicitly when calling via SDK.

cap

ParameterTypeRange / NotesDefault
probabilitynumber0 to 10.3
pause_msint or null0 to 10000, maps to turn_transition_time on the target1000

When the probability gate fires, the driver emits multiple consecutive messages in a single turn. The number of messages comes from the driver's get_default_consecutive_messages() setting (typically 2). The pause_ms controls the gap between the driver finishing and the target responding.

Reading Augmented Results

Compare the augmented run's mean_scores against a clean baseline. The expected pattern:

  • avg_turn_taking_latency rises as the agent works harder against noise or interruptions.
  • result_completed may drop if the agent can't handle the augmented conditions.
  • response_loop flips to 0 if the agent gets stuck after an interruption.
Augmented run results showing the impact of noise and barge-in on check scoresAugmented run results showing the impact of noise and barge-in on check scores

For a structured comparison between baseline and augmented runs, see Experimentation and A/B Testing.

Where to Go Next

Cookbook

Full runnable script: 06_augmentation.py