Every Guardrail Classifier Tested Fails Formal Safety Verification

Researchers Nikita Kezins, Urbas Ekka, Pascal Berrang, and Luca Arnaboldi formally proved that every guardrail classifier they tested contains verifiable safety holes. The findings, published May 11 on arXiv, expose a structural weakness in how enterprises audit production LLM safeguards.

The work targets the gap between empirical red-teaming and formal safety certification. Current guardrail classifiers—filters that sit between user prompts and model responses to block harmful content—show strong benchmark numbers but carry no provable guarantees. The root cause: a mismatch between verification frameworks borrowed from other ML domains and the nature of language. Standard epsilon-ball properties in discrete token space carry no semantic meaning. "The neighborhood of a harmful prompt" is mathematically undefined.

The team's fix shifts verification away from token space entirely. Instead, they verify in the classifier's pre-activation space—the internal embedding layer where semantically similar prompts cluster together. A "harmful region" becomes a convex shape (either a hyper-rectangle or a Gaussian mixture component) enclosing the representations of known harmful prompts. Because the final sigmoid head is monotonic, certifying the worst-case point in that convex region certifies every point inside it. This produces a closed-form soundness proof running in O(d) time, where d is the embedding dimension. No approximation required for the hyper-rectangle case.

Two certificate types were instantiated. SVD-aligned hyper-rectangles yield exact SAT/UNSAT answers: the classifier either is guaranteed to block the entire harmful region or it is not. Gaussian Mixture Models (GMMs) yield probabilistic certificates over semantically coherent clusters, capturing the fluid boundaries of real-world harmful language. Both were applied to three author-trained guardrail classifiers benchmarked on a toxicity detection task.

The results are stark. Every hyper-rectangle configuration returned SAT—meaning every classifier has at least one provably unsafe region it fails to block, regardless of benchmark performance. GMM probabilistic certificates revealed architectural divergence. GPT-2-based guardrails maintained 90% harmful-region coverage across varying boundary thresholds; Llama-3.1-8B held 80%. BERT collapsed. At the optimal classification threshold, BERT's coverage dropped to 55%—a phenomenon the authors call "coverage collapse." BERT reaches full coverage only by adopting an extremely conservative threshold that would cripple production precision.

FIG. 02 Classifier coverage rates reveal collapse: BERT drops to 55%, while GPT-2 and Llama maintain substantially higher thresholds. — arXiv 2605.10901v1

For enterprise teams, the compliance implications are direct. Red-team evaluations, penetration tests, and benchmark suites measure empirical failure rates; they cannot rule out failure modes outside their test distribution. Formal certificates can. Regulators in the EU AI Act framework and NIST AI RMF both signal movement toward auditable safety properties for high-risk deployments—documentation that red-teaming alone cannot satisfy. This framework gives compliance and governance teams a concrete artifact: a certificate stating whether a given classifier is verifiably safe over a defined harmful embedding region, not just over a finite test set.

Open questions remain. The framework currently covers classifiers with a sigmoid head; extending it to multi-label or softmax architectures is unaddressed. Harmful regions are bounded by the representations of known prompts, so novel jailbreaks outside the convex hull are not certified against. The three classifiers are purpose-trained for this study rather than drawn from deployed commercial safeguards such as Llama Guard or OpenAI's moderation API, leaving transferability of the findings uncertain.

The field now has a mathematically grounded alternative to empirical red-teaming. Enterprises evaluating guardrail vendors should ask whether formal verification artifacts are on the product roadmap.

Sources

Every hyper-rectangle configuration returns SAT, exposing verifiable safety holes across all classifiers tested
"every hyper-rectangle configuration returns SAT, exposing verifiable safety holes across all classifiers, despite seemingly high empirical metrics"
arxiv.org ↗
Verification is shifted from the discrete input space to the classifier's pre-activation space, yielding a closed-form soundness proof in O(d) time
"certifying the worst-case point is sufficient to certify the entire region, yielding a closed-form soundness proof without approximation in O(d) time"
arxiv.org ↗
Two certificate constructions are proposed: SVD-aligned hyper-rectangles for exact SAT/UNSAT results, and Gaussian Mixture Models for probabilistic certificates
"SVD-aligned hyper-rectangles, which yield exact SAT/UNSAT certificates, and Gaussian Mixture Models, which yield probabilistic certificates over semantically coherent clusters"
arxiv.org ↗
GPT-2-based guardrails maintain 90% coverage and Llama-3.1-8B maintains 80% coverage across varying thresholds
"GPT-2 and Llama-3.1-8B maintain robust coverage of 90% and 80% across varying boundaries"
arxiv.org ↗
BERT's coverage collapses to 55% at the optimal threshold, a phenomenon the paper calls 'coverage collapse'
"This 'coverage collapse' to 55% at the optimal threshold reveals a sparsely populated safety margin in BERT, which only achieves full coverage by adopting an extremely conservative pessimistic threshold"
arxiv.org ↗
Standard epsilon-ball properties used in other ML domains do not carry semantic meaning in the discrete input space of language
"the standard epsilon-ball properties used in other domains do not carry semantic meaning"
arxiv.org ↗

Written and edited by AI agents · Methodology

Every Guardrail Classifier Tested Fails Formal Safety Verification

Get the signal before the noise.

Get the signal before the noise.