MRI-Eval Finds LLMs Score 97% on Flashcards, 30% on Open Recall

Frontier LLMs ace MRI physics multiple-choice questions at rates above 93% while failing to recall vendor-specific operational procedures in free-text conditions — a gap that matters acutely when institutions consider deploying those models to guide clinical or research scanner workflows.

Perry E. Radau published MRI-Eval on arXiv on May 6, 2026. The benchmark targets a blind spot: nearly all prior MRI-focused evaluations use review-book multiple-choice questions where top proprietary models already saturate the leaderboard. MRI-Eval sources 1,365 scored items from textbooks, GE scanner manuals, programming course materials, and expert-generated questions across nine categories and three difficulty tiers. The distinguishing feature is vendor-specific operational content: GE scanner procedures, pulse sequence configuration, hardware-level protocols.

Five model families were tested: GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, and Llama 3.3 70B. The benchmark runs three conditions. Primary MCQ presents standard four-option questions. Stem-only removes answer choices and asks an independent LLM judge to evaluate free-text recall. Primed stem-only tests whether models correctly push back when a user asserts a wrong answer before asking for the correct one.

In standard MCQ, overall accuracy ranged from 93.2% to 97.1%. GE scanner operations was the weakest category, ranging from 88.2% to 94.6%. Remove the answer choices and the floor drops dramatically: frontier model accuracy in stem-only conditions fell to 58.4%–61.1%. Llama 3.3 70B fell to 37.1%. GE scanner operations stem-only accuracy collapsed to 13.8%–29.8% across all models. The best-performing model answered roughly three in ten questions correctly on GE operational knowledge without the multiple-choice scaffold.

FIG. 02 MCQ accuracy: frontier models 93–97% overall, but GE scanner operations consistently lagged by 5–9 percentage points. — MRI-Eval arXiv preprint

For enterprise AI architects, this is an architecture risk, not just a benchmark curiosity. The MCQ format mirrors how most vendor LLM evaluations are presented to buyers: curated question sets with options, a leaderboard score, a sign-off. MRI-Eval's stem-only condition simulates what happens at deployment — a technologist or researcher asks an open-ended question and expects a confident, accurate answer. The 30–40 percentage-point gap between conditions represents the failure mode that goes undetected until a protocol error surfaces in production.

The compliance dimension is direct. FDA guidance on AI/ML-based software as a medical device (SaMD) requires that performance claims tie to intended use context. A model claiming 97% MRI physics accuracy on an MCQ benchmark does not qualify as competent for protocol guidance on GE-specific scanner operations. This benchmark provides empirical scaffolding to force that distinction during procurement. Research institutions operating outside SaMD jurisdiction still carry patient safety and research integrity obligations that the same gap implicates.

FIG. 03 Dramatic performance collapse in open-recall (stem-only): frontier models dropped ~35 points; Llama 3.3 70B fell 56 points. — MRI-Eval arXiv preprint

Radau frames MRI-Eval as a relative comparison tool, not an absolute competency certifier — a meaningful caveat. The benchmark does not cover Siemens, Philips, or Canon scanner ecosystems, limiting applicability to multi-vendor environments. Expert-generated questions introduce subjectivity requiring audit. The primed condition, while useful for probing sycophancy under incorrect user pressure, is not yet reported in detail in the abstract.

MRI-Eval establishes a replicable methodology: tier questions by difficulty, source them from vendor documentation, and test free-text recall alongside MCQ. Models that performed best on MCQ did not preserve that ranking in stem-only conditions — meaning the benchmark has discriminative power. Any institution evaluating LLMs for instrumentation-adjacent knowledge work, whether in radiology, laboratory automation, or industrial equipment, now has a design template to build from. A 97% MCQ score that collapses below 30% in open recall is not a passing grade.

Sources

MRI-Eval includes 1,365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions
"MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions."
arxiv.org ↗
Five model families evaluated: GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, and Llama 3.3 70B
"Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B)."
arxiv.org ↗
Overall MCQ accuracy ranged from 93.2% to 97.1% across all models
"Overall MCQ accuracy was 93.2% to 97.1%."
arxiv.org ↗
GE scanner operations was the lowest MCQ category for every model, ranging from 88.2% to 94.6%
"GE scanner operations was the lowest category for every model (88.2% to 94.6%)."
arxiv.org ↗
In stem-only conditions, frontier model accuracy fell to 58.4%–61.1%; Llama 3.3 70B fell to 37.1%
"In stem-only, frontier-model accuracy fell to 58.4% to 61.1%, and Llama 3.3 70B fell to 37.1%."
arxiv.org ↗
GE scanner operations stem-only accuracy collapsed to 13.8%–29.8% across all models
"GE scanner operations stem-only accuracy was 13.8% to 29.8%."
arxiv.org ↗
MRI-Eval is described as a relative comparison benchmark and supports caution in using raw LLM outputs for GE-specific protocol guidance
"MRI-Eval is most informative as a relative comparison benchmark rather than an absolute competency measure and supports caution in using raw LLM outputs for GE-specific protocol guidance."
arxiv.org ↗
Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination
"Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination."
arxiv.org ↗

Written and edited by AI agents · Methodology

MRI-Eval Finds LLMs Score 97% on Flashcards, 30% on Open Recall

Get the signal before the noise.

Get the signal before the noise.