Skip to main content

Programmatic Red Teaming

Red teaming is the practice of actively trying to make your AI system fail in ways an adversary, a careless user, or a corner-case input might cause it to fail in production. It is the inverse of unit testing: rather than verifying that the happy path works, red teaming verifies that the unhappy path is contained.

For LLM and agent applications, ad hoc red teaming - "let me try a few jailbreaks before launch" - is not enough. The attack surface is too large, the failure modes are too varied, and the same agent's behavior shifts every time you change the model, the system prompt, or a tool. Okareo treats red teaming as a programmatic discipline: a reproducible suite of adversarial scenarios, attacker personas, and judges that runs every time the agent changes and produces an auditable record.

This page describes the discipline. The OWASP testing guide is the canonical implementation; Adversarial Drivers covers the attacker-persona primitive in depth.

Why red teaming has to be programmatic

Three properties of LLM and agent applications make manual red teaming break down:

  1. The output is non-deterministic. A jailbreak that succeeds 5% of the time still ships exploits to 5% of users. You cannot find a 5%-rate failure by hand; you need to run the same probe many times across slight prompt variations and measure a rate.
  2. The agent changes constantly. Every new model, system-prompt edit, tool addition, or RAG corpus update can re-open a closed vulnerability. A red-team report from last quarter does not tell you whether the agent is safe today.
  3. Coverage is non-trivial. OWASP enumerates 10 LLM and 10 agentic risk categories. Each decomposes into multiple scenarios. Each scenario has many possible variants. A reproducible suite is the only way to know what is covered.

Programmatic red teaming addresses all three: scenarios are versioned, runs are repeatable, and pass/fail rates are tracked over time.

The three primitives

Every red-teaming engagement on Okareo is built from three artifacts:

PrimitiveWhat it isWhere it lives
ScenarioA seed input (or set of inputs) representing a specific adversarial intent. "Direct prompt injection," "PII exfiltration probe," "permission escalation.".jsonl files - one row per seed
Adversarial DriverA language-model persona that plays the attacker across a multi-turn conversation. The driver knows the goal, has tactics to escalate, and stays in character when refused..md persona prompts - see Adversarial Drivers
CheckA judge that decides whether the agent's response or transcript represents a pass or a failure. Model-based for behavioral judgments, code-based for deterministic rules (regex, schema, allowlists)..md for model-based, .py for code-based

Holding the three separate is what makes the suite forkable. Add a new scenario without touching drivers. Swap a check without touching scenarios. Reuse a generic red-teamer driver across LLM01, LLM06, LLM07, and LLM10.

Single-turn checks vs multi-turn simulations

The single most important red-teaming decision is whether the failure mode you are testing is stateless or stateful.

Stateless failures show up in a single response. A direct prompt injection that asks the model to ignore its instructions; a probe that asks the model to repeat a credential it saw in context; a malformed input that triggers an unsafe code generation. Test these with single-turn evaluations: one adversarial input, one model response, one check.

Stateful failures only emerge across a conversation. A jailbreak that succeeds on turn 7 because the model accepted a small concession on turn 3. An agent that escalates its own permissions across actions. A system prompt that an attacker reconstructs piece by piece across many polite questions. Test these with multi-turn simulations: the adversarial driver pushes the target agent across 5-10 turns and the check evaluates the full transcript.

The OWASP suite makes the split per category. As a rule of thumb: agentic risks are usually stateful, content risks are usually stateless.

When to red team

StageWhat to runWhy
Pre-launchFull OWASP suite plus domain-specific scenariosEstablish a baseline. Block launch if Critical-severity categories are not passing.
Per changeThe subset relevant to the change. New tool? Run LLM06, ASI02. New RAG corpus? Run LLM08, LLM04.Catch regressions at the place that introduces them.
ContinuousOWASP suite on a schedule (nightly or per-PR)Detect drift from model upgrades, prompt edits, and library changes. Use scheduled simulations.
Post-incidentA new scenario that reproduces the incident, run against the patched agentConvert a real failure into a permanent regression test. The ASI08 trace bridge is built for this.

Engagement shape

A typical Okareo red-teaming engagement on a customer agent has three phases. The same shape works whether you are doing it as an external consultant or running it internally.

1. Scope and threat model

Start by writing down what failure modes matter for this agent. A customer-service chatbot has a different threat model from an autonomous code-modification agent. Output: a one-page threat model that maps domain risks to OWASP categories.

For example, a multi-agent network assurance system might map BGP injection and NETCONF misuse to LLM01 (indirect prompt injection in retrieved configs) and LLM06 (unauthorized tool invocation), and add a domain-specific category covering protocol-level abuse.

2. Adapt the suite

Fork compliance-owasp, point target.json at the agent, and add domain-specific scenarios, drivers, and checks alongside the OWASP defaults. See Customizing for your domain for the file layout.

If the agent has a guardrail layer (a separate model or service that filters inputs/outputs), test the guardrail independently. See Validating Guardrails.

3. Run, report, repeat

Run the suite, produce a report, and wire the suite into CI so that next month's report measures regression rather than starting from scratch.

The report should answer four questions:

  1. Coverage. Which OWASP categories did we test? Which scenarios per category?
  2. Findings. Which scenarios produced failing responses? At what rate? What is the severity?
  3. Recommended remediations. Per finding: prompt change, tool-permission scope, guardrail rule, or model swap.
  4. Reproducibility. A pinned commit of the suite, the target config, and the Okareo run IDs so any reviewer can re-run.
Audit trail

Okareo evaluation runs are durable, timestamped records linked to scenario, check, and target versions. The same record set that closes out a red-teaming engagement also serves as evidence for an audit. The report is a presentation layer over the runs, not a separate artifact you have to keep in sync.

What red teaming is not

Red teaming is not a substitute for:

  • Quality evaluations. Red teaming asks "can this fail under attack?" Quality evaluation asks "does this work for normal users?" Both are required. See Evaluation.
  • Production monitoring. Red teaming runs adversarial inputs against a target you control. Monitoring watches real user traffic. The two should share checks where possible (run the same hallucination check in red-team scenarios and in monitoring), but they are different surfaces. See Monitoring.
  • Guardrail validation. A guardrail and a model are different things. If your agent has a guardrail layer, test it independently with the independent-test pattern.

Where to go next