SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

arXiv cs.CV·Hyunwoong Kim, Seongeun Lee, Hannah Yun, Junhyun Park, Jonggwon Park

3d ago

·~2 min·6/10/2026·en·0

Quick Answer

Quick Take

The proposed Segment-Decomposed GRPO (SD-GRPO) enhances long-form vision-language generation by normalizing per-segment rewards, outperforming traditional GRPO in tasks with increasing semantic complexity. Evaluations on multi-panel dense-captioning and scientific figure captioning show significant performance gains, particularly with higher segment counts, confirming its effectiveness in multimodal applications.

Key Points

SD-GRPO replaces single scalar rewards with per-segment advantages for better performance.
Outperforms GRPO baseline in controlled dense-captioning tasks with independent segments.
Demonstrates cross-segment credit misattribution in long-form VQA tasks.
Combines holistic and per-segment rewards for improved results in semantically entangled tasks.
Integrates easily into existing GRPO frameworks with minimal overhead.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 09871v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images.

To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar.

We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts.

Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length.

On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup