Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization
Quick Answer
The PTD-PO framework enhances Large Vision-Language Models (LVLMs) by providing dense guidance without revealing answers, improving multimodal reasoning performance.
Quick Take
The PTD-PO framework enhances Large Vision-Language Models (LVLMs) by providing dense guidance without revealing answers, improving multimodal reasoning performance. Experiments show it outperforms RLVR and distillation baselines, stabilizing learning and mitigating entropy collapse across models with 2B to 8B parameters.
Key Points
- PTD-PO uses structured privileged hints for token-distribution supervision.
- It mitigates computational overhead compared to traditional external teacher methods.
- The framework stabilizes distillation with a Top-K Jensen-Shannon divergence objective.
- Experiments demonstrate improved performance on complex multimodal reasoning tasks.
- PTD-PO effectively aligns failed rollouts with hint-augmented reference models.
Article Content
From source RSS / original summaryarXiv:2606. 07000v1 Announce Type: new Abstract: Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision for failed rollouts, often leading to inefficient exploration in complex multimodal reasoning tasks.
Although policy distillation can offer dense guidance, external teacher based methods introduce substantial computational overhead, while answer conditioned tuning methods may expose answer-level information and induce shortcut-like generation behavior. To address these limitations, we propose PTD-PO, a Privileged Tutoring Distillation Policy Optimization framework for RLVR that provides dense guidance without exposing the answer to the student policy.
Specifically, PTD-PO constructs structured privileged hints from spatial attention guidance and intermediate textual reasoning steps, and uses them through in-context learning to produce step-wise token-distribution supervision. The student is still optimized under the original answer-free context, and its failed rollouts are aligned with the hint-augmented reference model at the token-distribution level.
To further stabilize distillation under the distribution shift between guided and unguided contexts, we introduce a Top-K Jensen-Shannon divergence objective that focuses alignment on informative token probabilities while reducing memory overhead. Experiments on LVLMs ranging from 2B to 8B parameters show that PTD-PO consistently outperforms RLVR and distillation baselines, mitigates entropy collapse, and improves complex multimodal reasoning performance.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
This paper addresses the sim-to-real gap for foundation model agents by framing it within a Markov Decision Process (MDP) structure. It advocates for established solutions like domain randomization to enhance agent robustness, aiming to create standardized benchmarks for reliable real-world applications.