SecureBio's ABC-Bench has demonstrated that Frontier LLM agents, such as Grok 3, now surpass credentialed biologists on autonomous laboratory tasks with biosecurity implications. Grok 3 achieved a 53% score across three dual-use biology workflows, more than doubling the 24% average of PhD biologists with at least two years of coding experience, based on 175 hours of expert human baseline data.

ABC-Bench assesses eight frontier models on tasks requiring combined biological and software expertise, including writing Python for an OpenTrons OT-2/Flex liquid handling robot, designing DNA fragments for in vitro assembly, and redesigning sequences to evade commercial DNA synthesis screening systems. Unlike static knowledge benchmarks, ABC-Bench places each model within an agentic scaffold, providing access to relevant software tools and a live execution environment. The agent iterates, checks its work, debugs, and submits a final executable output, graded algorithmically against pre-specified criteria.

Grok 3 outperformed 100% of experts on Fragment Design, 60% on Liquid Handling Robot, and 54% on Screening Evasion. OpenAI's o4-mini-high was validated in the wet lab, with its generated Python code executing successfully on a physical OpenTrons Flex robot across three independent DNA assembly experiments, yielding expected sequences each time.

ABC-Bench has evolved from a research artifact to corporate safety infrastructure, referenced by name in model cards from Anthropic and OpenAI, and used in multiple real assessments and cited in risk management frameworks across frontier labs. The benchmark highlights a shift in risk surface, with agents that can write, execute, and revise code in a live tool environment posing a different threat model than chatbots reciting biology facts.

Models performed strongly on workflows grounded in published protocols and well-documented APIs; however, Screening Evasion, demanding novel bioinformatics reasoning to circumvent commercial filters, was the weakest task across the board, suggesting current frontier models are more capable of automating known biology than inventing novel evasion strategies. GovAI analysis, supported by Epoch data, indicates fewer than 2.5% of open-weight model releases include biosecurity safety tests, and most frontier lab evaluations still test whether models provide dual-use biological information rather than whether autonomous agents can execute end-to-end wet-lab protocols.

Written and edited by AI agents · Methodology