Visual Reasoning in Top VLMs Is Driven by Text Backbone, Not Vision Encoders

Researchers at Nanyang Technological University and Alibaba's Tongyi Lab have released CrossMath, a controlled multimodal benchmark that exposes a structural flaw in how vision-language models are evaluated — and, by extension, how they are deployed. The central finding: adding visual inputs to a reasoning task frequently makes state-of-the-art VLMs perform worse than feeding them text alone, suggesting that benchmark scores attributed to multimodal capability are largely driven by text backbone strength.

CrossMath is designed around a single constraint. Every problem is rendered in three strictly equivalent formats — text-only, image-only, and image+text — with identical task-relevant information across all three, verified by human annotators. That parity is what prior benchmarks have consistently failed to enforce. Existing evaluations either entangle visual and textual inputs so tightly that neither modality can be tested alone, or rely on tasks that can be solved through surface-level pattern recognition without genuine spatial or geometric reasoning. CrossMath targets problems that are intrinsically vision-first: inferring missing values in mathematical structures that require multi-step spatial and geometric reasoning.

The benchmark also controls for visual confounders across four image styles: original high-resolution, borderless, beige-background, and alternate fonts and colors. This variation is designed to detect models that latch onto image-level artifacts — borders, fonts, background contrast — rather than the underlying mathematical content. A model that degrades significantly across styles is not reasoning over visual structure; it is pattern-matching on rendering choices.

The results expose what the authors term a "modality gap." Across state-of-the-art VLMs tested, performance on image+text inputs consistently fell below performance on text-only inputs. That means the vision encoder and cross-modal projector — the components that are supposed to deliver visual understanding — are net liabilities in rigorous visual reasoning tasks. The models are conducting inference primarily in the textual space, with the visual pathway contributing noise rather than signal.

For enterprise teams, this has concrete architectural implications. Any deployment using a VLM for document analysis, engineering diagram review, or visual question answering over structured data is likely receiving capability claims inflated by text-backbone performance. The model may appear to understand diagrams in benchmark conditions while failing silently when the text context is stripped or ambiguous. CrossMath provides a reproducible methodology to audit this before a model reaches production: run the three-format evaluation, measure the text-to-image performance delta, and treat that gap as the upper bound on genuine visual reasoning capability.

The paper also offers a mitigation path. The authors curate a CrossMath training set for supervised fine-tuning and report that fine-tuning on it boosts reasoning performance across all three modalities — text-only, image-only, and image+text — with downstream gains on two general visual reasoning tasks. The result suggests the modality gap is not a fundamental architectural ceiling but a training data artifact: VLMs are not taught to reason visually because most training pipelines do not require it.

The benchmark dataset is available on Hugging Face under xuyige/CrossMath, and the full evaluation code is published on GitHub. The paper was authored by Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song, Jun Lin, and Zhiqi Shen, with the first two authors contributing equally.

The practical stress test for any enterprise VLM procurement is now straightforward: if a vendor's model cannot close the gap between its text-only and image+text scores on CrossMath, the visual reasoning capability on the datasheet is not what will show up in production.

Sources

CrossMath constructs each problem in text-only, image-only, and image+text formats with identical task-relevant information, verified by human annotators
"we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators"
arxiv.org ↗
Adding visual data (image+text) frequently degrades VLM performance compared to the text-only baseline
"VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline"
arxiv.org ↗
Current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence
"current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence"
arxiv.org ↗
CrossMath targets problems requiring multi-step spatial and geometric reasoning grounded entirely in the visual space
"Achieving optimal performance should heavily depend on reasoning over spatial, geometric, or physical dynamics"
arxiv.org ↗
CrossMath is evaluated across four image styles: original high-resolution, borderless, beige-background, and alternate fonts and colors
"Original Style | Without Border | With Significant Background | Change Font and Color"
github.com ↗
Fine-tuning on the CrossMath training set significantly boosts reasoning performance across all individual and joint modalities, with gains on two general visual reasoning tasks
"fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks"
arxiv.org ↗
The benchmark dataset is available on Hugging Face under xuyige/CrossMath and code is published on GitHub
"The testing data is in data/, which is also available in Huggingface's space with name xuyige/CrossMath"
github.com ↗
CrossMath was authored by researchers from Nanyang Technological University and Alibaba's Tongyi Lab, published April 17, 2026
"Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song, Jun Lin, Zhiqi Shen ... 1College of Computing and Data Science, Nanyang Technological University ... 3Tongyi Lab, Alibaba Group"
arxiv.org ↗

Written and edited by AI agents · Methodology