MathSight: A Benchmark for Mathematical Visual Reasoning

Abstract

Recent advances in Vision-Language Models (VLMs) have achieved impressive progress in multimodal mathematical reasoning. Yet, how much visual information truly contributes to reasoning remains unclear. Existing benchmarks report strong overall performance but seldom isolate the role of the image modality, leaving open whether VLMs genuinely leverage visual understanding or merely depend on linguistic priors.

To address this, we present MathSight, a university-level multimodal mathematical reasoning benchmark designed to disentangle and quantify the effect of visual input. Each problem includes multiple visual variants—original, hand-drawn, photo-captured—and a text-only condition for controlled comparison.

Experiments on state-of-the-art VLMs reveal a consistent trend: the contribution of visual information diminishes with increasing problem difficulty. Remarkably, Qwen3-VL without any image input surpasses both its multimodal variants and GPT-5, underscoring the need for benchmarks like MathSight to advance genuine vision-grounded reasoning in future models.

🔔 Updates

2025-11-15: Initial MathSight dataset and code released.

Introduction

MathSight contains 2,048 university-level math problems, including 661 multimodal items with diagrams and 1,387 text-only items. It is designed to probe whether modern vision-language models truly benefit from visual input in mathematical reasoning.

Multi-View Visual Reasoning

Each multimodal problem is paired with multiple visual variants—original, hand-drawn, photo-captured, and a text-only condition—to disentangle the contribution of vision from linguistic priors.

Challenging University-Level Coverage

Problems span six core subjects and are predominantly graduate-level in the multimodal track, bridging traditional text-only math benchmarks and real exam-style visual reasoning.

Vision vs. Text Analysis

MathSight enables controlled comparisons of state-of-the-art VLMs with and without images, revealing when models truly rely on visual understanding and when text alone is sufficient.

Why MathSight?

Most multimodal math benchmarks show high overall accuracy but rarely isolate how much models truly rely on visual input. MathSight is designed to answer a simple question: have VLMs really seen?

Visual robustness gap. Small changes in style (hand-drawn, photo-captured) can flip answers.
Missing visual variants. Prior datasets usually provide only one clean diagram per problem.
Text-only confound. Many problems can be partly solved from text alone.

Motivation behind MathSight: existing benchmarks vs. our design with richer visual variants and harder problems.

Example Problems

A few representative MathSight problems illustrating visually grounded mathematical reasoning.

Illustration of a point at the intersection of two constraints and its associated normal cone.

Geometric visualization of projecting a vector onto a subspace and its orthogonal complement.

Example of geometric solids used in optimization and volume-related problems.

Main Results

We highlight key evaluation results of state-of-the-art Vision-Language Models (VLMs) and text-only LMs on the MathSight benchmark.

Key Findings

Strong text-only LMs can match or surpass VLMs on MathSight, indicating heavy reliance on linguistic priors.
Visual robustness is limited — performance varies across original, hand-drawn, and photo-captured variants.
Image size and visual style both matter, but the benefit of visual input shrinks as problem difficulty increases.

The figures on the right summarize overall scores, image-size sensitivity, multimodal vs text-only comparison, and answer consistency patterns.

Overall evaluation results of Vision-Language Models on the MathSight benchmark. — Overall accuracy of closed-source and open-source VLMs on MathSight under original (V.Orig), hand-drawn (V.Draw), and photo-captured (V.Photo) visual variants.

Box plots of group coefficient of variation for GPT-4o variants. — Box plots of group coefficient of variation (group_CV) for GPT-4o variants. Each box covers the central 95% region of group_CV values, the horizontal line marks the median group_CV, and points outside indicate rare outliers.

Comparison between text-only and different multimodal inputs. — Comparison between text-only (V.w/o image) and multimodal inputs for Qwen3 models. Qwen3-VL without image input even surpasses GPT-5 on MathSight, highlighting the strong power of language-only reasoning.

Answer-consistency patterns across three visual variants. — Distribution of answer-consistency patterns across original (A), photo-captured (B), and hand-drawn (C) variants for GPT-5 and Qwen3-VL-235B-A22B, revealing how visual style changes affect prediction agreement.

Evaluation results of VLMs on images of different sizes. Shrinking image resolution only mildly degrades accuracy for most models, suggesting that visual information is often under-utilized compared with text.

Comparison with Existing Benchmarks

MathSight complements existing mathematical and visual reasoning benchmarks with a focus on real exam-style problems grounded in images.

Table 1: Existing benchmarks. “Visual Variant” indicates that visual input has different variants, “Grad. Level” presents graduate-level problems, and “Pro.Q” stands for proving questions.

Benchmark	Grad. Level	Pro.Q	Visual Variant			Image Size
Benchmark	Grad. Level	Pro.Q	Original	Hand-drawn	Photo-captured	Image Size
TheoremQA	✗	✗	✓	✗	✗	✗
MathVista	✗	✗	✓	✗	✗	✗
Scibench	✗	✗	✓	✗	✗	✗
QRData	✗	✗	✓	✗	✗	✗
MATH-Vision	✗	✗	✓	✗	✗	✗
MMMU_math	✗	✗	✓	✗	✗	✗
U-Math	✗	✗	✓	✗	✗	✗
Dynamath	✗	✗	✓	✗	✗	✗
MathCheck	✗	✗	✓	✗	✗	✗
MathSight (our work)	✓	✓	✓	✓	✓	✓

MathSight

Abstract

🔔 Updates

Introduction

Multi-View Visual Reasoning

Challenging University-Level Coverage

Vision vs. Text Analysis

Why MathSight?

Example Problems

Dataset Statistics

Main Results

Key Findings

Comparison with Existing Benchmarks

Problem Explorer

Citation

Contact

Yuandong Wang

Zhen Yang