Hand Trajectory Fusion for Egocentric Natural Language Query Grounding
Quick Take
The proposed hand-trajectory encoder enhances Egocentric Natural Language Query grounding by integrating hand motion with video-text features, achieving significant performance improvements on the Ego4D NLQ v2 benchmark, particularly for Hand-Object Interaction (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3). This approach highlights the importance of hand motion in understanding context beyond visual appearance.
Key Points
- Introduces a hand-trajectory encoder for improved NLQ grounding.
- Achieves +2.54 R1@IoU=0.3 on Hand-Object Interaction queries.
- Records +4.32 R1@IoU=0.3 for Quantity/State queries.
- Demonstrates hand motion's significance in query context.
- Utilizes cross-attention fusion with adaptive gating.
Article Excerpt
From source RSS / original summaryarXiv:2606. 02962v1 Announce Type: new Abstract: Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.
We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2. 54 R1@IoU=0. 3) and Quantity/State queries (+4. 32 R1@IoU=0. 3), indicating that hand trajectory provides grounding cues beyond appearance alone.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records
Plan2Map introduces a 208-case benchmark for reconstructing geospatial boundaries from UK planning documents. The GeoPlanAgent system achieves a mean IoU of 0.736, significantly outperforming baseline models, highlighting the challenges in localization and map registration.