Main Results
We highlight key evaluation results of state-of-the-art Vision-Language Models (VLMs) and text-only LMs on the MathSight benchmark.
Key Findings
- Strong text-only LMs can match or surpass VLMs on MathSight, indicating heavy reliance on linguistic priors.
- Visual robustness is limited — performance varies across original, hand-drawn, and photo-captured variants.
- Image size and visual style both matter, but the benefit of visual input shrinks as problem difficulty increases.
The figures on the right summarize overall scores, image-size sensitivity, multimodal vs text-only comparison, and answer consistency patterns.