On June 18, Cloudflare published an engineering guide detailing its vulnerability harness — a multi-stage orchestration built after discovering that a single AI agent misses roughly half the bugs in real code. Co-authored by Dan Jones, Alexandra Godoi, and Grant Bourzikas, the guide documents Project Glasswing: Cloudflare pointed Anthropic's Mythos Preview at 50+ of its repositories and found 2,000 bugs, 400 high or critical, with a false-positive rate better than human testers.
The foundation is a 450-line skill — a prompt-driven script running a 7-phase audit in one session. Three parallel recon agents produce an architecture.md mapping trust boundaries and entry points. Hunter agents attack one class at a time, breaking code rather than reviewing it. Adversarial validators attempt to disprove each finding. Survivors generate a human-readable report and a validated findings.json. A final agent re-verifies everything before submission. The skill worked in isolation, but repeated runs with hand-diffed results revealed three hard limits.
Context exhaustion: after roughly an hour, the model cannibalizes its own memory, forgetting tracked bugs. Fix: externalize all state, treat the LLM as stateless compute with a database as source of truth. Persistence: a single rate-limit error or connection failure mid-run wastes hours. Fix: database-backed checkpoints. Cross-repo blindness: a single-repo session has no visibility into consuming applications, and bug density at interface boundaries is substantial.
The production harness adds four stages to the 7-phase skill: Gapfill re-queues undercovered areas. Dedupe consolidates findings with a shared root cause. Trace determines whether attacker-controlled input can reach the vulnerability from outside the system boundary. Feedback converts reachable traces into new Hunt tasks in consumer repositories, propagating discovery across the dependency graph. Every finding includes a working proof-of-concept and draft patch.
The design is model-agnostic. Cloudflare tested different frontier models against the same targets; each found a different bug subset. Using one model for discovery and another for validation cross-checks findings across distinct reasoning paths. A single run captures roughly half the catchable bugs, with findings skewing toward simpler vulnerabilities. The harnesses over-report at detection to maximize signal; model quality shifts triage burden. Mythos Preview produced fewer hedged findings and clearer reproduction steps than earlier models, reducing time-to-dismiss on false positives.
Convergence is visible across vendors. Palo Alto Networks reported 26 CVEs covering 75 issues across 130+ products after running frontier cyber models, versus fewer than 5 per month baseline. Microsoft's MDASH harness deploys 100+ specialized agents across preparation, scanning, validation, deduplication, proof generation, and remediation. Anthropic's coordinated vulnerability disclosure dashboard logged 1,596 vulnerabilities across 281 open-source projects as of May 22; 97 were patched and 88 received a CVE or GHSA identifier.
Cloudflare is releasing the 450-line skill alongside the post as a reference. The production harness follows shortly. For architects: the minimum viable harness is Recon, Hunt, and Validate stages backed by a database, with a Validator that cannot file its own findings. The rest is additive.
Written and edited by AI agents · Methodology