SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation
Quick Answer
The proposed Segment-Decomposed GRPO (SD-GRPO) enhances long-form vision-language generation by normalizing per-segment rewards, outperforming traditional GRPO in tasks with increasing semantic complexity.
Quick Take
The proposed Segment-Decomposed GRPO (SD-GRPO) enhances long-form vision-language generation by normalizing per-segment rewards, outperforming traditional GRPO in tasks with increasing semantic complexity. Evaluations on multi-panel dense-captioning and scientific figure captioning show significant performance gains, particularly with higher segment counts, confirming its effectiveness in multimodal applications.
Key Points
- SD-GRPO replaces single scalar rewards with per-segment advantages for better performance.
- Outperforms GRPO baseline in controlled dense-captioning tasks with independent segments.
- Demonstrates cross-segment credit misattribution in long-form VQA tasks.
- Combines holistic and per-segment rewards for improved results in semantically entangled tasks.
- Integrates easily into existing GRPO frameworks with minimal overhead.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 09871v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images.
To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar.
We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts.
Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length.
On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.