OpenAI releases GeneBench-Pro; tests AI judgment on 129 multi-stage genomics problems; GPT-5.6 Sol reaches 31.5%
<cite index="63-3,64-1">OpenAI released GeneBench-Pro, a 129-problem benchmark across 10 primary domains and 21 subdomains covering genomics, quantitative biology, and translational medicine. Each problem provides an agent with a realistic, deliberately noisy dataset and a target estimand tied to a downstream scientific or translational decision.</cite> <cite index="64-2">GeneBench-Pro probes what OpenAI calls 'research taste': the chain of judgment calls about which questions a dataset can support, when early diagnostics should change the model, and when a result is decision-ready.</cite> <cite index="61-1">OpenAI submitted 82 of the 129 problems to external domain experts including graduate students, postdoctoral researchers, industry scientists, and professors who assessed each problem's realism and whether the target answer was identifiable.</cite>
<cite index="63-2">GPT-5.6 Sol reaches 28.7% pass rate at max reasoning level, and GPT-5.6 Sol Pro reaches 31.5%; GPT-5.5 reaches 12%, GPT-5.4 reaches 8.9%, and Anthropic's Claude Opus 4.8 reaches 16%.</cite> <cite index="64-3">Test-time compute scaling shows that at lowest reasoning level GPT-5.6 Sol scores in single digits, and at highest it solves roughly six times as many questions as GPT-5.2 while using about two-thirds the tokens.</cite> <cite index="63-2">Models often complete substantial portions of the workflow but exhibit a consistent gap between noticing and acting: they identify local diagnostic signals but fail to propagate implications to corresponding analysis decisions, selecting wrong estimators or persisting on incorrect paths.</cite>
<cite index="61-3">If agents can reliably automate this class of analysis, they could significantly accelerate scientific discovery. The limiting factor in biobank-scale genomic research is shifting from data generation to turning the information into actionable insights; models that can consistently perform analyses handled by teams of human experts could transform industrial research by accelerating hypothesis triage and target follow-up.</cite> For biotech teams and pharma researchers evaluating AI-for-science tools, GeneBench-Pro measures the capability that determines whether an agent assists discovery or confidently produces wrong answers. The 60%+ of problems below 20% pass rate signals ample headroom for investment before models saturate the benchmark.