Grok 3 Surpasses Credentialed Biologists on Autonomous DNA Lab Tasks

SecureBio's ABC-Bench has demonstrated that Frontier LLM agents, such as Grok 3, now surpass credentialed biologists on autonomous laboratory tasks with biosecurity implications. Grok 3 achieved a 53% score across three dual-use biology workflows, more than doubling the 24% average of PhD biologists with at least two years of coding experience, based on 175 hours of expert human baseline data.

ABC-Bench assesses eight frontier models on tasks requiring combined biological and software expertise, including writing Python for an OpenTrons OT-2/Flex liquid handling robot, designing DNA fragments for in vitro assembly, and redesigning sequences to evade commercial DNA synthesis screening systems. Unlike static knowledge benchmarks, ABC-Bench places each model within an agentic scaffold, providing access to relevant software tools and a live execution environment. The agent iterates, checks its work, debugs, and submits a final executable output, graded algorithmically against pre-specified criteria.

Grok 3 outperformed 100% of experts on Fragment Design, 60% on Liquid Handling Robot, and 54% on Screening Evasion. OpenAI's o4-mini-high was validated in the wet lab, with its generated Python code executing successfully on a physical OpenTrons Flex robot across three independent DNA assembly experiments, yielding expected sequences each time.

ABC-Bench has evolved from a research artifact to corporate safety infrastructure, referenced by name in model cards from Anthropic and OpenAI, and used in multiple real assessments and cited in risk management frameworks across frontier labs. The benchmark highlights a shift in risk surface, with agents that can write, execute, and revise code in a live tool environment posing a different threat model than chatbots reciting biology facts.

Models performed strongly on workflows grounded in published protocols and well-documented APIs; however, Screening Evasion, demanding novel bioinformatics reasoning to circumvent commercial filters, was the weakest task across the board, suggesting current frontier models are more capable of automating known biology than inventing novel evasion strategies. GovAI analysis, supported by Epoch data, indicates fewer than 2.5% of open-weight model releases include biosecurity safety tests, and most frontier lab evaluations still test whether models provide dual-use biological information rather than whether autonomous agents can execute end-to-end wet-lab protocols.

Sources

All tested LLM agents outperformed the median expert human baseliner on all three tasks; PhD biologists with ≥2 years coding experience scored only 24% on average across tasks; 175 hours of expert human baselines collected
"These tasks require a combination of biology and software expertise. All tested LLM agents outperformed the median expert human baseliner on all three tasks."
arxiv.org ↗
Grok 3 scored 53% aggregate across tasks, outperforming 100% of experts on Fragment Design, 60% on Liquid Handling Robot, 54% on Screening Evasion
"the top-performing LLM, Grok 3, achieves 53% across tasks, outperforming 60%, 100%, and 54% of experts on the Liquid Handling Robot, Fragment Design, and Screening Evasion tasks, respectively"
openreview.net ↗
OpenAI's o4-mini-high generated code that ran on a physical OpenTrons Flex robot and successfully assembled DNA with expected sequences in three independent wet-lab experiments
"In three wet-lab validation experiments, we found that OpenAI's o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences."
arxiv.org ↗
ABC-Bench tasks are referenced by name in model cards from Anthropic and OpenAI; benchmark used in multiple real assessments and cited in risk management frameworks across frontier labs
"ABC-Bench shows that AI agents can increasingly undertake biosecurity-relevant tasks across both in-silico design and wet-lab experiments... Several of these efforts were presented at NeurIPS and used in multiple real assessments."
securebio.org ↗
Fewer than 2.5% of open-weight model releases include biosecurity safety tests; most frontier labs only evaluate whether models provide dual-use biological information
"developers should conduct biosecurity safety tests before releasing open-weight models, a commitment that over 100 researchers have endorsed but carried out in fewer than 2.5% of model releases"
governance.ai ↗
ABC-Bench evaluates agents on three tasks: liquid handling robot coding, DNA fragment design, and synthesis screening evasion, using an agentic scaffold with live tool access
"ABC-Bench evaluates LLM agents on both benign and dual-use biology tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening."
arxiv.org ↗

Written and edited by AI agents · Methodology

Grok 3 Surpasses Credentialed Biologists on Autonomous DNA Lab Tasks

Get the signal before the noise.

Get the signal before the noise.