GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
Quick Take
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.
Key Points
- GeoSym Engine leverages type-conditional grammar for precise geometric reasoning.
- Dataset includes 51K images and 127K questions with symbolic ground truths.
- Qwen3-VL-8B model shows significant performance improvements on geometry tasks.
📖 Reader Mode
~2 min readAuthors:Jinhao Jing, Zheng Ma, Jinwei Liang, Qiannian Zhao, Shawn Chen, Jing Yang, Por Lip Yee, Prayag Tiwari, Jingjing Bai, Benyou Wang, Lewei Lu, Zhan Su
Abstract:Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at this https URL and this https URL.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.16371 [cs.CV] |
| (or arXiv:2605.16371v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16371 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Jinhao Jing [view email]
[v1]
Sun, 10 May 2026 13:13:47 UTC (6,173 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search
This paper presents FairNAD, a semi-automated approach for efficient neural architecture search using structured design knowledge.
