ABC-Bench Shows LLM Agents Now Outperform Expert Biologists on Lab Tasks

The ABC-Bench biosecurity benchmark indicates that off-the-shelf LLM agents now surpass the median PhD biologist on autonomous laboratory tasks that combine software engineering and molecular biology. Expert human baseliners—doctoral biologists with at least two years of coding experience—averaged 24 percent across three tasks, while Grok 3 achieved 53 percent overall and outperformed 100 percent of those experts on DNA fragment design.

ABC-Bench, introduced in an arXiv paper by researchers at Active Site and presented at the NeurIPS 2025 Workshop on Biosecurity Safeguards for Generative AI, tests agentic execution rather than textbook recall. The suite requires writing Python to control an OpenTrons OT-2 liquid handling robot, designing DNA fragments for in vitro assembly, and engineering sequences that evade commercial DNA synthesis screening. In three independent wet-lab validation experiments, code produced by OpenAI's o4-mini-high ran on the OT-2 and assembled DNA matching the expected sequences, confirming that in-silico benchmark scores translate directly to physical capability.

Grok 3 led the model field at 53 percent, outperforming 60 percent of experts on the liquid-handling task and 54 percent on screening evasion. A companion novice-uplift study adds operational context: novices with LLM assistance were 4.16 times more accurate than internet-only controls, and on three of four benchmarks they surpassed unassisted experts. Standalone LLMs often scored higher than the LLM-assisted novices, indicating that current interfaces fail to elicit the full hazardous capability already present in the weights.

For architects deploying agents with function-calling access to lab automation APIs, ABC-Bench is the exact eval regulators will demand: it measures whether an agent can autonomously close an end-to-end biology workflow, from code generation to physical sample handling. The benchmark is already cited in model cards and risk management frameworks from Anthropic, Google DeepMind, Meta, OpenAI, and xAI. A GovAI analysis tied to the work argues that assumptions that "coding is hard" are decaying as a safety layer, and that physical chokepoints—specifically mandatory DNA synthesis screening—are more durable than model refusals or training-data filters.

The benchmark reveals that agents already perform well on screening evasion, the task that probes the last physical chokepoint. It also shows a capability cliff: agents excel when tasks rely on published protocols and well-documented APIs but weaken on novel bioinformatics reasoning. This weakness is not a reliable safeguard, as the same model can still outperform a human expert on the overall workflow. The novice-uplift data shows the binding constraint is user elicitation, not model knowledge, which means a determined operator with API access can iterate toward the full 53 percent capability. If your serving layer exposes models to life-sciences toolchains, safety evals need to move beyond static toxicity classifiers to agentic end-to-end tests with wet-lab validation and novice-elicitation ceilings.

The pattern to steal is replacing static content filters with agentic evaluations that include physical-world validation and novice-elicitation tests, because model weights already encode more biosecurity risk than current interfaces typically surface.

Sources

ABC-Bench evaluates agents on three tasks: liquid handling robot code, DNA fragment design, and DNA synthesis screening evasion. All tested LLM agents outperformed the median expert human baseliner. o4-mini-high produced scripts that successfully assembled DNA on an OpenTrons OT-2 in 3 independent wet-lab experiments.
"All tested LLM agents outperformed the median expert human baseliner on all three tasks. In three wet-lab validation experiments, we found that OpenAI's o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences."
arxiv.org ↗
PhD biologist expert baselines averaged 24% on ABC-Bench tasks. Grok 3 scored 53% overall, outperforming 60% of experts on liquid-handling, 100% on fragment design, and 54% on screening evasion.
"PhD biologists with at least two years of coding experience attempted the tasks in ABC-Bench, they scored only 24% on average. By contrast, the top-performing LLM, Grok 3, achieves 53% across tasks, outperforming 60%, 100%, and 54% of experts on the Liquid Handling Robot, Fragment Design, and Screening Evasion tasks, respectively."
openreview.net ↗
ABC-Bench is cited in model cards and risk frameworks from Anthropic, Google DeepMind, Meta, OpenAI, and xAI.
"Our benchmarks and evaluations have been cited in model cards or risk management frameworks for major releases from all the frontier labs, including Anthropic, Google DeepMind, Meta, OpenAI, and xAI."
securebio.substack.com ↗
LLM novice uplift study: novices with LLMs were 4.16× more accurate than internet-only controls; standalone LLMs often exceeded LLM-assisted novices.
"novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from them."
arxiv.org ↗
GovAI analysis argues physical chokepoints like mandatory DNA synthesis screening are more durable safeguards than model refusals or data filters as coding agents grow more capable.
"Policymakers should invest in physical 'chokepoint' safeguards like mandatory DNA synthesis screening and securing dual-use pathogen datasets – both of which may be more robust interventions in the face of powerful coding agents than data filtering or LLM refusals."
governance.ai ↗

Written and edited by AI agents · Methodology

ABC-Bench Shows LLM Agents Now Outperform Expert Biologists on Lab Tasks

Get the signal before the noise.

Get the signal before the noise.