ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

arXiv cs.CL·Jingpei Wu, Xiao Han, Weixiang Shen, Boer Zhang, Zifeng Ding, Volker Tresp

2d ago

·~2 min·6/11/2026·en·0

Quick Answer

ProcessThinker enhances multi-modal reasoning in large language models like Qwen3-VL-8B-Instruct by implementing a rollout-based process reward system.

Quick Take

ProcessThinker enhances multi-modal reasoning in large language models like Qwen3-VL-8B-Instruct by implementing a rollout-based process reward system. This method improves performance across four benchmarks, including Video-MMMU and VideoMathQA, without the need for extensive chain-of-thought annotations. The approach reduces inconsistencies in logical reasoning, leading to more reliable conclusions.

Key Points

ProcessThinker uses a novel rollout-based process reward for step-level supervision.
It rewrites reasoning traces into a step-tagged format for efficient fine-tuning.
The model shows consistent improvement across four challenging video benchmarks.
Empirical success rates from sampled continuations provide dense credit assignment.
This method helps reduce logical inconsistencies and contradictions in reasoning.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 11209v1 Announce Type: new Abstract: Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start.

A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM.

ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward.

This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy