Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation
Quick Take
Dex2HOI introduces a unified diffusion model for generating dexterous bimanual two-object interactions from text, achieving up to 540x speed-up in inference over existing methods. This model utilizes a Dual-Stream Diffusion approach and a Motion Fusion Network, demonstrating state-of-the-art performance on both single- and two-object benchmarks, paving the way for advanced multi-object manipulation.
Key Points
- Dex2HOI employs a Dual-Stream Diffusion approach for object interaction synthesis.
- Achieves up to 540x faster inference speed compared to previous state-of-the-art methods.
- Integrates a Motion Fusion Network for enhanced motion synthesis.
- Demonstrates state-of-the-art results on single- and two-object benchmarks.
- Code and models will be released upon acceptance of the paper.
Article Content
From source RSS / original summaryarXiv:2605. 30444v1 Announce Type: new Abstract: Recent advances in 4D Human-Object Interaction (HOI) generation have enabled increasingly realistic motion synthesis, particularly for single-object manipulation. Yet current research overlooks an inherent property of human behavior: people naturally coordinate both hands and manipulate multiple objects simultaneously. To address this gap, we present Dex2HOI, a unified diffusion model for single- and two-object HOI synthesis from text.
At its core, Dex2HOI employs a Dual-Stream Diffusion approach, where each object is processed in a dedicated interaction stream and coordinated through bidirectional cross-attention. To synthesize the final motion, we introduce a Motion Fusion Network integrated with novel hand-relative object representations and contact-aware conditioning applied across the whole sequence.
By sampling the diffusion process autoregressively over prefix-conditioned windows, Dex2HOI generates arbitrarily long sequences at real-time speed omitting redundant test-time optimization, achieving up to x540 inference speed-up over prior state-of-the-art methods. Extensive evaluation on both single- and two-object benchmarks demonstrates state-of-the-art quantitative results, marking a step beyond conventional single-object HOI generation and toward expressive multi-object manipulation.
Code and models will be released upon acceptance.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.
