Cloudflare's AI Harness Surfaces 2,000 Bugs in Production Code

On June 18, Cloudflare published an engineering guide detailing its vulnerability harness — a multi-stage orchestration built after discovering that a single AI agent misses roughly half the bugs in real code. Co-authored by Dan Jones, Alexandra Godoi, and Grant Bourzikas, the guide documents Project Glasswing: Cloudflare pointed Anthropic's Mythos Preview at 50+ of its repositories and found 2,000 bugs, 400 high or critical, with a false-positive rate better than human testers.

The foundation is a 450-line skill — a prompt-driven script running a 7-phase audit in one session. Three parallel recon agents produce an architecture.md mapping trust boundaries and entry points. Hunter agents attack one class at a time, breaking code rather than reviewing it. Adversarial validators attempt to disprove each finding. Survivors generate a human-readable report and a validated findings.json. A final agent re-verifies everything before submission. The skill worked in isolation, but repeated runs with hand-diffed results revealed three hard limits.

FIG. 02 The 7-phase vulnerability audit skill and its three primary bottlenecks. — Cloudflare Blog

Context exhaustion: after roughly an hour, the model cannibalizes its own memory, forgetting tracked bugs. Fix: externalize all state, treat the LLM as stateless compute with a database as source of truth. Persistence: a single rate-limit error or connection failure mid-run wastes hours. Fix: database-backed checkpoints. Cross-repo blindness: a single-repo session has no visibility into consuming applications, and bug density at interface boundaries is substantial.

The production harness adds four stages to the 7-phase skill: Gapfill re-queues undercovered areas. Dedupe consolidates findings with a shared root cause. Trace determines whether attacker-controlled input can reach the vulnerability from outside the system boundary. Feedback converts reachable traces into new Hunt tasks in consumer repositories, propagating discovery across the dependency graph. Every finding includes a working proof-of-concept and draft patch.

The design is model-agnostic. Cloudflare tested different frontier models against the same targets; each found a different bug subset. Using one model for discovery and another for validation cross-checks findings across distinct reasoning paths. A single run captures roughly half the catchable bugs, with findings skewing toward simpler vulnerabilities. The harnesses over-report at detection to maximize signal; model quality shifts triage burden. Mythos Preview produced fewer hedged findings and clearer reproduction steps than earlier models, reducing time-to-dismiss on false positives.

FIG. 03 Vulnerability findings across Cloudflare, Anthropic, and Palo Alto Networks frontier-model audits. — Cloudflare, Anthropic CVD, Palo Alto Networks

Convergence is visible across vendors. Palo Alto Networks reported 26 CVEs covering 75 issues across 130+ products after running frontier cyber models, versus fewer than 5 per month baseline. Microsoft's MDASH harness deploys 100+ specialized agents across preparation, scanning, validation, deduplication, proof generation, and remediation. Anthropic's coordinated vulnerability disclosure dashboard logged 1,596 vulnerabilities across 281 open-source projects as of May 22; 97 were patched and 88 received a CVE or GHSA identifier.

Cloudflare is releasing the 450-line skill alongside the post as a reference. The production harness follows shortly. For architects: the minimum viable harness is Recon, Hunt, and Validate stages backed by a database, with a Validator that cannot file its own findings. The rest is additive.

Sources

Cloudflare found 2,000 bugs (400 high or critical) across critical-path systems via Project Glasswing, with a false-positive rate better than human testers
"Cloudflare has found 2,000 bugs (400 of which are high- or critical-severity) across their critical-path systems, with a false positive rate that Cloudflare's team considers better than human testers."
anthropic.com ↗
Cloudflare's vulnerability harness architecture: ~450-line skill, 7-phase audit, three bottlenecks (context exhaustion, persistence, cross-repo blindness), and extended harness stages (Gapfill, Dedupe, Trace, Feedback)
"A real but minimal harness consists of just Recon, Hunt, and Validate stages kept in a database, alongside a separate Validator that can't file its own findings"
blog.cloudflare.com ↗
Anthropic's CVD dashboard logged 1,596 vulnerabilities across 281 open-source projects as of May 22, 2026; 97 patched, 88 assigned CVE or GHSA
"As of May 22, 2026, we've disclosed 1,596 vulnerabilities across 281 open source projects. To our knowledge, 97 of these have been patched. Of those, 88 have been assigned a Common Vulnerabilities and Exposure (CVE) record or a GitHub Security Advisory (GHSA)."
red.anthropic.com ↗
Palo Alto Networks reported 26 CVEs covering 75 issues across more than 130 products using frontier cyber models, vs. fewer than 5 CVEs per month baseline
"Palo Alto Networks: After using Anthropic's Mythos and OpenAI's frontier cyber models, Palo Alto reported 26 CVEs representing 75 issues across more than 130 products — compared with its usual volume of fewer than 5 CVEs in a month."
thejasonfleagle.com ↗
Microsoft's MDASH harness uses more than 100 specialized agents across preparation, scanning, validation, deduplication, proof, and remediation
"Microsoft's MDASH: A multi-model agentic scanning harness with more than 100 specialized agents across preparation, scanning, validation, deduplication, proof, and remediation."
thejasonfleagle.com ↗
Cloudflare's harnesses are tuned to over-report at detection stage; Mythos Preview produced fewer hedged findings and clearer reproduction steps than prior models
"Our harnesses are deliberately tuned to over-report, so we see more (and miss less), which comes with a lot more noise. But at triage time, Mythos Preview's output has noticeably higher quality: fewer hedged findings, clearer reproduction steps, and less work to reach a fix-or-dismiss decision."
blog.cloudflare.com ↗

Written and edited by AI agents · Methodology

Cloudflare's AI Harness Surfaces 2,000 Bugs in Production Code

Get the signal before the noise.

Get the signal before the noise.