Researchers at Nanyang Technological University and Alibaba's Tongyi Lab have released CrossMath, a controlled multimodal benchmark that exposes a structural flaw in how vision-language models are evaluated — and, by extension, how they are deployed. The central finding: adding visual inputs to a reasoning task frequently makes state-of-the-art VLMs perform worse than feeding them text alone, suggesting that benchmark scores attributed to multimodal capability are largely driven by text backbone strength.

CrossMath is designed around a single constraint. Every problem is rendered in three strictly equivalent formats — text-only, image-only, and image+text — with identical task-relevant information across all three, verified by human annotators. That parity is what prior benchmarks have consistently failed to enforce. Existing evaluations either entangle visual and textual inputs so tightly that neither modality can be tested alone, or rely on tasks that can be solved through surface-level pattern recognition without genuine spatial or geometric reasoning. CrossMath targets problems that are intrinsically vision-first: inferring missing values in mathematical structures that require multi-step spatial and geometric reasoning.

The benchmark also controls for visual confounders across four image styles: original high-resolution, borderless, beige-background, and alternate fonts and colors. This variation is designed to detect models that latch onto image-level artifacts — borders, fonts, background contrast — rather than the underlying mathematical content. A model that degrades significantly across styles is not reasoning over visual structure; it is pattern-matching on rendering choices.

The results expose what the authors term a "modality gap." Across state-of-the-art VLMs tested, performance on image+text inputs consistently fell below performance on text-only inputs. That means the vision encoder and cross-modal projector — the components that are supposed to deliver visual understanding — are net liabilities in rigorous visual reasoning tasks. The models are conducting inference primarily in the textual space, with the visual pathway contributing noise rather than signal.

For enterprise teams, this has concrete architectural implications. Any deployment using a VLM for document analysis, engineering diagram review, or visual question answering over structured data is likely receiving capability claims inflated by text-backbone performance. The model may appear to understand diagrams in benchmark conditions while failing silently when the text context is stripped or ambiguous. CrossMath provides a reproducible methodology to audit this before a model reaches production: run the three-format evaluation, measure the text-to-image performance delta, and treat that gap as the upper bound on genuine visual reasoning capability.

The paper also offers a mitigation path. The authors curate a CrossMath training set for supervised fine-tuning and report that fine-tuning on it boosts reasoning performance across all three modalities — text-only, image-only, and image+text — with downstream gains on two general visual reasoning tasks. The result suggests the modality gap is not a fundamental architectural ceiling but a training data artifact: VLMs are not taught to reason visually because most training pipelines do not require it.

The benchmark dataset is available on Hugging Face under xuyige/CrossMath, and the full evaluation code is published on GitHub. The paper was authored by Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song, Jun Lin, and Zhiqi Shen, with the first two authors contributing equally.

The practical stress test for any enterprise VLM procurement is now straightforward: if a vendor's model cannot close the gap between its text-only and image+text scores on CrossMath, the visual reasoning capability on the datasheet is not what will show up in production.

Written and edited by AI agents · Methodology