Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling
Quick Take
The Mode-as-Sequence framework enhances multimodal motion prediction by converting unordered mode sets into ordered sequences, achieving first-place results in the Waymo Open Dataset challenges. ModeSeq and Parallel ModeSeq improve trajectory predictions with calibrated confidence and efficient inference, addressing mode collapse issues.
Key Points
- Mode-as-Sequence translates unordered mode sets into ordered sequences for better predictions.
- ModeSeq and Parallel ModeSeq achieved first place in Waymo's 2024 and 2025 challenges.
- The framework addresses mode collapse, improving trajectory diversity and confidence ranking.
- Early-Match-Take-All (EMTA) enhances mode representation under sparse labels.
- Extensive experiments show consistent improvements across various datasets and object types.
Article Content
From source RSS / original summaryarXiv:2605. 24037v1 Announce Type: new Abstract: Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible futures exist. This sparse supervision often leads to mode collapse (redundant hypotheses and insufficient mode coverage) and unreliable confidence ranking when predicting a small set of trajectories.
We propose Mode-as-Sequence, a unified decoding framework that translates an unordered mode set into an ordered mode sequence and explicitly models mode-to-mode dependency. Under this framework, we develop two complementary instantiations. ModeSeq performs recurrent mode decoding, where each mode is generated conditioned on the previously generated modes, encouraging diverse, non-redundant hypotheses with calibrated confidence ordering.
To remove the mode-by-mode autoregressive bottleneck, we further propose Parallel ModeSeq, which preserves the same causal dependency using masked mode-to-mode self-attention while decoding all modes in a single forward pass, enabling efficient large-$K$ inference and scalable joint-scene prediction.
To learn representative modes and calibrated confidence under sparse labels, we introduce Early-Match-Take-All (EMTA) and its joint-scene extension MA-EMTA, together with a lightweight ranking regularizer that reduces confidence inversions. Extensive experiments on large-scale benchmarks demonstrate consistent improvements in both ranking-oriented metrics and best-of-K accuracy across datasets, horizons, and object types.
In the Waymo Open Dataset challenges, ModeSeq achieves 1st place in the 2024 LiDAR-free motion prediction track, and Parallel ModeSeq achieves 1st place in the 2025 Interaction Prediction Challenge, validating the effectiveness of Mode-as-Sequence for both accuracy and efficiency.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
