Vision-Language Models Show No Advantage in Text-Only Alignment

A neuroimaging study across three language and vision-language model pairs found that multimodal pretraining does not universally enhance human alignment during text-only reading, suggesting that the added cost may not yield representational gains on abstract language tasks. Researchers from MIT's McGovern Institute, City University of Hong Kong, and Chongqing University assessed the models against whole-cortex fMRI recordings and synchronized eye-tracking saccades from human natural-reading datasets, feeding the models text alone at inference time to isolate the effect of visual training history.

The study's experimental design is a significant methodological advancement. The authors used closely matched pairs from the same architectural lineage, varying only the presence of multimodal pretraining, rather than comparing disparate model families where differences in architecture, parameter count, training corpus, or post-training regime could confound attribution. By withholding visual input during inference, the team attributed effects specifically to the model's learning history rather than to online cross-modal fusion. Brain alignment was scored against voxel-level fMRI responses across the entire cortex, and behavioral alignment was measured via eye-movement patterns, providing a dual-signal benchmark more granular than typical downstream-task accuracy.

The arXiv paper reports no consistent VLM advantage. The authors find that language-internal representations remain the key factor for modeling human text processing; the VLM advantage emerges more selectively when sentences contain stronger visual semantic content—concrete, imagistic language that might engage visual association areas, with converging evidence from both fMRI and eye-movement alignments. This aligns with prior literature showing that scaling LLMs from 774 million to 65 billion parameters improves fMRI and eye-tracking fit, and that multimodal models excel only when visual grounding is relevant. The authors propose their work as a controlled in silico framework for disentangling these factors.

This is a laboratory measurement paper with no production evidence, and the authors provide no serving metrics—no throughput numbers, per-token pricing, or p99 latency. Instead, they offer a model-selection signal. In a text-only pipeline—RAG over documents, summarization, classification, or instruction-following without images—the VLM variant of a given family yields no more human-like internal representations than its LLM counterpart. The extra compute and memory footprint of multimodal weights buys no alignment benefit unless the input is rich in concrete visual semantics.

For platform teams, the challenge is generalizability. The study holds architecture, scale, and data mixture constant, conditions that rarely hold in commercial selection, where a "comparable" VLM and LLM may differ in post-training recipes, context length, or instruction-tuning datasets. The authors acknowledge that these factors have historically complicated brain-alignment estimates and remain confounds when engineers shop across API catalogs. Another gap is the leap from neural alignment to practical utility: whole-cortex fMRI correlation is an intriguing intermediate metric, but it is not a substitute for end-to-end task accuracy or human preference rankings.

The evidence is not uniform. Bavaresco et al. report that VLMs outperform language-only models on fMRI alignment with isolated concept words from the Pereira dataset—180 discrete concepts presented with pictures or sentences—not the continuous natural reading Wu et al. study. Bavaresco's suite also relies on older encoder-style architectures such as LXMERT and IDEFICS2 rather than tightly matched modern generative pairs, and the authors note that only some of those VLMs learn genuinely more human-like concepts while others are merely sensitive to inference-time context. That discrepancy matters for platform teams: in natural-reading-style text pipelines—RAG, summarization, classification, long-form instruction following—the multimodal premium buys no proven alignment benefit, but workloads centered on isolated grounded concepts or actual multimodal input may still favor a VLM.

Sources

Multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading; language-internal representations remain the key factor for modeling human text processing
"Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing."
arxiv.org ↗
VLM advantage emerges selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments
"The VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments."
arxiv.org ↗
Study used three tightly matched LLM/VLM pairs under identical text-only inputs to isolate the effect of multimodal training history from online visual input or cross-modal fusion
"We compare three LLM/VLM pairs under identical text-only inputs, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion."
arxiv.org ↗
Authors are affiliated with MIT's McGovern Institute for Brain Research, City University of Hong Kong, and Chongqing University
"Correspondence: Zitong Lu (zitonglu@mit.edu). McGovern Institute for Brain Research, Massachusetts Institute of Technology."
arxiv.org ↗
Scaling LLMs from 774M to 65B parameters improves fMRI and eye-tracking alignment, while instruction tuning adds no benefit
"We show that as the model size increases from 774M to 65B, the alignment with human eye movement and fMRI activity patterns also significantly improves, adhering to a scaling law. By contrast, instruction tuning does not affect this alignment."
nature.com ↗
Bavaresco et al. find VLMs outperform language-only counterparts in both experimental conditions (picture and sentence context) for isolated concept word fMRI alignment
"Our results reveal that VLMs outperform the language-only counterparts in both experimental conditions."
arxiv.org ↗
Only some VLMs (LXMERT, IDEFICS2) show brain alignment that stems from genuinely learning more human-like concepts during pretraining; others are highly sensitive to inference-time context
"Controlled ablation studies show that only for some VLMs, such as LXMERT and IDEFICS2, brain alignment stems from genuinely learning more human-like concepts during pretraining, while others are highly sensitive to the context provided at inference."
arxiv.org ↗
Vision-language encoders are more brain-aligned than more recent, generative VLMs
"vision-language encoders are more brain-aligned than more recent, generative VLMs"
arxiv.org ↗

Written and edited by AI agents · Methodology

Vision-Language Models Show No Advantage in Text-Only Alignment

Get the signal before the noise.

Get the signal before the noise.