MathSight

A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning?

Yuandong Wang Yao Cui Yuxin Zhao Zhen Yang* Yangfu Zhu Zhenzhou Shao
*Corresponding Author

Abstract

Recent advances in Vision-Language Models (VLMs) have achieved impressive progress in multimodal mathematical reasoning. Yet, how much visual information truly contributes to reasoning remains unclear. Existing benchmarks report strong overall performance but seldom isolate the role of the image modality, leaving open whether VLMs genuinely leverage visual understanding or merely depend on linguistic priors.

To address this, we present MathSight, a university-level multimodal mathematical reasoning benchmark designed to disentangle and quantify the effect of visual input. Each problem includes multiple visual variants—original, hand-drawn, photo-captured—and a text-only condition for controlled comparison.

Experiments on state-of-the-art VLMs reveal a consistent trend: the contribution of visual information diminishes with increasing problem difficulty. Remarkably, Qwen3-VL without any image input surpasses both its multimodal variants and GPT-5, underscoring the need for benchmarks like MathSight to advance genuine vision-grounded reasoning in future models.

🔔 Updates

  • 2025-11-15: Initial MathSight dataset and code released.

Introduction

MathSight contains 2,048 university-level math problems, including 661 multimodal items with diagrams and 1,387 text-only items. It is designed to probe whether modern vision-language models truly benefit from visual input in mathematical reasoning.

Multi-View Visual Reasoning

Each multimodal problem is paired with multiple visual variants—original, hand-drawn, photo-captured, and a text-only condition—to disentangle the contribution of vision from linguistic priors.

Challenging University-Level Coverage

Problems span six core subjects and are predominantly graduate-level in the multimodal track, bridging traditional text-only math benchmarks and real exam-style visual reasoning.

Vision vs. Text Analysis

MathSight enables controlled comparisons of state-of-the-art VLMs with and without images, revealing when models truly rely on visual understanding and when text alone is sufficient.

Why MathSight?

Most multimodal math benchmarks show high overall accuracy but rarely isolate how much models truly rely on visual input. MathSight is designed to answer a simple question: have VLMs really seen?

MathSight motivation example

Motivation behind MathSight: existing benchmarks vs. our design with richer visual variants and harder problems.

Example Problems

A few representative MathSight problems illustrating visually grounded mathematical reasoning.

Constraint normal cone illustration
Illustration of a point at the intersection of two constraints and its associated normal cone.
Vector projection onto a subspace
Geometric visualization of projecting a vector onto a subspace and its orthogonal complement.
Solid geometry shapes
Example of geometric solids used in optimization and volume-related problems.

Dataset Statistics

2048 Total Problems
661 Multimodal items
632*3 Images
6 Subjects
1713 / 335 None-proving questions / proving questions
1668 / 380 Graduate-level / Undergraduate-level

Main Results

We highlight key evaluation results of state-of-the-art Vision-Language Models (VLMs) and text-only LMs on the MathSight benchmark.

Key Findings

  • Strong text-only LMs can match or surpass VLMs on MathSight, indicating heavy reliance on linguistic priors.
  • Visual robustness is limited — performance varies across original, hand-drawn, and photo-captured variants.
  • Image size and visual style both matter, but the benefit of visual input shrinks as problem difficulty increases.

The figures on the right summarize overall scores, image-size sensitivity, multimodal vs text-only comparison, and answer consistency patterns.

Overall evaluation results of Vision-Language Models on the MathSight benchmark.
Overall accuracy of closed-source and open-source VLMs on MathSight under original (V.Orig), hand-drawn (V.Draw), and photo-captured (V.Photo) visual variants.
Box plots of group coefficient of variation for GPT-4o variants.
Box plots of group coefficient of variation (group_CV) for GPT-4o variants. Each box covers the central 95% region of group_CV values, the horizontal line marks the median group_CV, and points outside indicate rare outliers.
Comparison between text-only and different multimodal inputs.
Comparison between text-only (V.w/o image) and multimodal inputs for Qwen3 models. Qwen3-VL without image input even surpasses GPT-5 on MathSight, highlighting the strong power of language-only reasoning.
Answer-consistency patterns across three visual variants.
Distribution of answer-consistency patterns across original (A), photo-captured (B), and hand-drawn (C) variants for GPT-5 and Qwen3-VL-235B-A22B, revealing how visual style changes affect prediction agreement.
Evaluation results of VLMs on images of different sizes.
Evaluation results of VLMs on images of different sizes. Shrinking image resolution only mildly degrades accuracy for most models, suggesting that visual information is often under-utilized compared with text.

Comparison with Existing Benchmarks

MathSight complements existing mathematical and visual reasoning benchmarks with a focus on real exam-style problems grounded in images.

Table 1: Existing benchmarks. “Visual Variant” indicates that visual input has different variants, “Grad. Level” presents graduate-level problems, and “Pro.Q” stands for proving questions.

Benchmark Grad. Level Pro.Q Visual Variant Image Size
Original Hand-drawn Photo-captured
TheoremQA
MathVista
Scibench
QRData
MATH-Vision
MMMUmath
U-Math
Dynamath
MathCheck
MathSight (our work)

Problem Explorer

Browse a small subset of MathSight problems by difficulty, subject, and modality.

Select difficulty and problem type, then click “Show Example”.

Citation

@article{2025mathsight, title = {MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning?}, author = {Yuandong Wang and Cui Yao and Yuxin Zhao and Zhen Yang and Yangfu Zhu and Zhenzhou Shao}, journal = {arXiv preprint arXiv:XXXX.XXXXX}, year = {2025} }