ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward
Quick Answer
ProcessThinker enhances multi-modal reasoning in large language models like Qwen3-VL-8B-Instruct by implementing a rollout-based process reward system.
Quick Take
ProcessThinker enhances multi-modal reasoning in large language models like Qwen3-VL-8B-Instruct by implementing a rollout-based process reward system. This method improves performance across four benchmarks, including Video-MMMU and VideoMathQA, without the need for extensive chain-of-thought annotations. The approach reduces inconsistencies in logical reasoning, leading to more reliable conclusions.
Key Points
- ProcessThinker uses a novel rollout-based process reward for step-level supervision.
- It rewrites reasoning traces into a step-tagged format for efficient fine-tuning.
- The model shows consistent improvement across four challenging video benchmarks.
- Empirical success rates from sampled continuations provide dense credit assignment.
- This method helps reduce logical inconsistencies and contradictions in reasoning.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11209v1 Announce Type: new Abstract: Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start.
A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM.
ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward.
This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.