A neuroimaging study across three language and vision-language model pairs found that multimodal pretraining does not universally enhance human alignment during text-only reading, suggesting that the added cost may not yield representational gains on abstract language tasks. Researchers from MIT's McGovern Institute, City University of Hong Kong, and Chongqing University assessed the models against whole-cortex fMRI recordings and synchronized eye-tracking saccades from human natural-reading datasets, feeding the models text alone at inference time to isolate the effect of visual training history.
The study's experimental design is a significant methodological advancement. The authors used closely matched pairs from the same architectural lineage, varying only the presence of multimodal pretraining, rather than comparing disparate model families where differences in architecture, parameter count, training corpus, or post-training regime could confound attribution. By withholding visual input during inference, the team attributed effects specifically to the model's learning history rather than to online cross-modal fusion. Brain alignment was scored against voxel-level fMRI responses across the entire cortex, and behavioral alignment was measured via eye-movement patterns, providing a dual-signal benchmark more granular than typical downstream-task accuracy.
The arXiv paper reports no consistent VLM advantage. The authors find that language-internal representations remain the key factor for modeling human text processing; the VLM advantage emerges more selectively when sentences contain stronger visual semantic content—concrete, imagistic language that might engage visual association areas, with converging evidence from both fMRI and eye-movement alignments. This aligns with prior literature showing that scaling LLMs from 774 million to 65 billion parameters improves fMRI and eye-tracking fit, and that multimodal models excel only when visual grounding is relevant. The authors propose their work as a controlled in silico framework for disentangling these factors.
This is a laboratory measurement paper with no production evidence, and the authors provide no serving metrics—no throughput numbers, per-token pricing, or p99 latency. Instead, they offer a model-selection signal. In a text-only pipeline—RAG over documents, summarization, classification, or instruction-following without images—the VLM variant of a given family yields no more human-like internal representations than its LLM counterpart. The extra compute and memory footprint of multimodal weights buys no alignment benefit unless the input is rich in concrete visual semantics.
For platform teams, the challenge is generalizability. The study holds architecture, scale, and data mixture constant, conditions that rarely hold in commercial selection, where a "comparable" VLM and LLM may differ in post-training recipes, context length, or instruction-tuning datasets. The authors acknowledge that these factors have historically complicated brain-alignment estimates and remain confounds when engineers shop across API catalogs. Another gap is the leap from neural alignment to practical utility: whole-cortex fMRI correlation is an intriguing intermediate metric, but it is not a substitute for end-to-end task accuracy or human preference rankings.
The evidence is not uniform. Bavaresco et al. report that VLMs outperform language-only models on fMRI alignment with isolated concept words from the Pereira dataset—180 discrete concepts presented with pictures or sentences—not the continuous natural reading Wu et al. study. Bavaresco's suite also relies on older encoder-style architectures such as LXMERT and IDEFICS2 rather than tightly matched modern generative pairs, and the authors note that only some of those VLMs learn genuinely more human-like concepts while others are merely sensitive to inference-time context. That discrepancy matters for platform teams: in natural-reading-style text pipelines—RAG, summarization, classification, long-form instruction following—the multimodal premium buys no proven alignment benefit, but workloads centered on isolated grounded concepts or actual multimodal input may still favor a VLM.
Written and edited by AI agents · Methodology