Skip to main content

Validating Guardrails

A guardrail is the safety layer that sits between users and your model: an input filter, an output filter, or both. It might be a managed service (a moderation API, a policy classifier), a separate small model you maintain, or a rule engine. Whatever it is, the guardrail has two jobs:

  1. Block disallowed traffic. Inputs containing prompt injection, PII, or policy-violating content should never reach the model. Outputs containing leaked credentials, unsafe code, or hallucinated harmful advice should never reach the user.
  2. Allow allowed traffic. Legitimate users should not be blocked.

Both jobs need testing, and they need testing independently of the model. This page describes the independent-test pattern: how to validate a guardrail as a unit, before it is wired into the agent it protects.

Why test the guardrail independently

The instinct is to test the agent end-to-end. Send adversarial inputs, measure how often the system as a whole produces a violation, ship if the rate is low enough.

This is a trap, for two reasons:

  1. You cannot diagnose failures. When an end-to-end test fails, you do not know whether the model leaked, the input guardrail missed the attack, or the output guardrail failed to scrub the leaked content. All three need different fixes. Without per-layer evaluations, every regression turns into a multi-day investigation.
  2. You cannot measure false positives. The guardrail's job is also to not block legitimate traffic. End-to-end tests measure whether bad inputs got through; they do not measure whether good inputs got blocked. A guardrail with a 99% block rate and a 30% false-positive rate is unshippable, but the end-to-end suite will not tell you that.

The independent-test pattern fixes both. Test the guardrail as a function: input goes in, allow/block decision comes out, the test scores the decision against an expected label.

The pattern

Three datasets, three metrics, one Custom Check.

Three datasets

DatasetContainsExpected label
AdversarialInputs (or outputs) that should be blocked. Prompt injections, PII probes, jailbreak openings, policy violations. Often sourced from compliance-owasp scenarios.block
BenignLegitimate, in-distribution inputs. Real user messages, support queries, normal task requests.allow
BorderlineEdge cases that humans disagree on. Aggressive but legal language, sensitive but allowed topics, unusual formatting.Whatever your written policy says, with a reviewer-confirmed label.

The benign and borderline sets are the ones teams usually skip. Don't. Most guardrail failures in production are false positives, not bypasses.

Three metrics

MetricFormulaTarget
Block rate (true positive rate)blocks_on_adversarial / total_adversarialHigh. >95% for Critical-severity categories.
False-positive rateblocks_on_benign / total_benignLow. Define a hard ceiling (e.g. <2%) before measuring.
Bypass rate1 - block_rateLow. Equivalent to block rate; track separately so the framing is "what fraction of attacks succeed."

Track all three over time. A guardrail that improves block rate from 92% to 96% but increases false positives from 1% to 8% is a regression, not an improvement.

One target, one check

Treat the guardrail as a Target, exactly like you would treat the model. Its endpoint takes an input, returns a decision (allow or block, sometimes with a reason or a confidence score). Run the three datasets through it as separate evaluations:

# Pseudocode - exact API depends on your guardrail's interface
guardrail_target = okareo.create_endpoint_target({
"endpoint_url": "https://your-guardrail/classify",
"request_body": {"input": "{scenario_input}"},
"response_path": "decision", # returns "allow" or "block"
})

okareo.run_test(
target=guardrail_target,
scenario=adversarial_scenario,
checks=["guardrail_decision_match"], # passes if decision == expected
)

The check is trivial: compare the guardrail's decision to the expected label. The work is in the datasets. See Custom Checks for how to write the check itself.

Connecting guardrail tests to red teaming

The output of a red-teaming engagement is a list of attacks that worked. The output of a guardrail test is a list of attacks the guardrail caught vs missed. These should feed each other:

  • Every red-team finding becomes a guardrail test case. If the agent leaked PII on turn 4 of a multi-turn simulation, the message that triggered the leak goes into the adversarial dataset for the input guardrail, and the leaked response goes into the adversarial dataset for the output guardrail.
  • Every benign-traffic false positive becomes a regression test. If a real user message got blocked, add it to the benign dataset before fixing the guardrail. This prevents the fix from re-introducing the bypass.

The closed loop turns a red-teaming engagement from a one-time report into a permanent test surface that grows with every incident.

Same scenarios, different target

The OWASP scenarios in compliance-owasp are written against the agent. To repurpose them for guardrail testing, point your guardrail at them as the Target instead of the agent. The scenarios already carry the right adversarial intent; only the expected outcome changes (the guardrail should block, the agent should refuse).

Input guardrails vs output guardrails

The pattern is the same for both, but the datasets are different:

LayerAdversarial datasetBenign datasetWhat you are catching
Input guardrailPrompt injections, jailbreak openings, PII probes, malicious tool-call requestsReal user messages, support queries, in-domain task requestsAttacks before they reach the model
Output guardrailModel responses that contain leaked credentials, unsafe code, hallucinated harmful advice, schema violationsReal model responses to legitimate inputsFailures the model produces, before they reach the user

A common mistake is to use input-shaped data to test an output guardrail. Output guardrails see model responses, not user messages, so the dataset has to be model responses. The fastest source is to capture real model output during red-team simulations and label it.

What "shippable" looks like

Define block rate and false-positive rate ceilings in writing before you measure. Without a written target, you will rationalize whatever number you get.

Reasonable starting targets:

  • Critical-severity categories (LLM01 prompt injection, LLM02 PII leakage, LLM06 excessive agency): block rate >95%, false-positive rate <2%.
  • High-severity categories (LLM05 output handling, LLM07 system prompt leakage): block rate >90%, false-positive rate <3%.
  • Medium-severity categories (LLM09 misinformation, LLM10 unbounded consumption): block rate >80%, false-positive rate <5%.

These are starting points, not standards. Calibrate to your domain: a healthcare guardrail can tolerate much lower false-positive rates than a developer-tooling one.

Continuous guardrail validation

A guardrail's failure modes drift over time as attackers find new patterns and as the upstream provider updates the underlying model. Run guardrail validation on a schedule, not just at launch.

A practical setup:

  1. Per-PR: run the full adversarial + benign + borderline suite against the staging guardrail. Block merge if block rate or false-positive rate regresses past defined thresholds.
  2. Nightly: run the same suite against production. Page on regression.
  3. On-incident: add the failing case to the dataset, re-run, confirm fix, merge.

See Scheduled Simulations for the scheduling primitive and Monitoring for production-traffic monitoring of the same checks.

Where to go next