PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models
Quick Answer
PerceptionDLM is a multimodal diffusion language model that enhances parallel region perception, achieving state-of-the-art performance in visual understanding tasks.
Quick Take
PerceptionDLM is a multimodal diffusion language model that enhances parallel region perception, achieving state-of-the-art performance in visual understanding tasks. By introducing efficient prompting and structured attention masking, it allows simultaneous captioning of multiple regions, significantly improving inference speed. The new ParaDLC-Bench benchmark validates its competitive performance and efficiency in multi-region tasks.
Key Points
- PerceptionDLM optimizes parallel region perception using multimodal diffusion language models.
- Introduces efficient prompting and structured attention masking for simultaneous region captioning.
- Achieves significant speed improvements over sequential region processing methods.
- Validates performance with the new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench).
- First to leverage diffusion language models for parallel region captioning and perception.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 19534v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception.
Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels.
This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency.
Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.