An LMM for Precisely Grounding Elements in Documents
Quick Answer
PreciseDoc is a new Large Multimodal Model (LMM) designed for accurate visual grounding in text-rich documents, enhancing localization capabilities through synthetic training data and joint reinforcement learning.
Quick Take
PreciseDoc is a new Large (LMM) designed for accurate visual grounding in text-rich documents, enhancing localization capabilities through synthetic training data and joint reinforcement learning. Evaluations show improved performance in document spatial grounding and understanding tasks, addressing limitations of existing models.
Key Points
- PreciseDoc improves grounding precision in document images for better reasoning.
- Synthetic training data includes hand-filled documents with fine-grained coordinates.
- Joint reinforcement learning enhances the model's grounded reasoning capabilities.
- Comprehensive evaluations demonstrate advantages over existing grounding methods.
- The model can locate critical elements like personal information in CVs.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 24118v1 Announce Type: new Abstract: Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning.
To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise element grounding and can be further optimized for Document VQA tasks. Specifically, to enhance the basic localization capability, we construct challenging training data by two pipelines capable of mass-producing high-quality documents with paired metadata of fine-grained coordinates, including synthetic hand-filled documents with camera effects.
The model develops more real-world functions beyond straightforward localization of single text, such as locating personal information from CVs. Furthermore, we introduce a training paradigm for visual grounded reasoning where the grounding and reasoning are supervised jointly with reinforcement learning to improve the contribution of the grounded evidence.
A comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods in document spatial grounding and document understanding.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.