Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

arXiv cs.CV·Enmin Zhong, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso Garc\'ia

4h ago

·~1 min·6/3/2026·en·0

Quick Take

The proposed hand-trajectory encoder enhances Egocentric Natural Language Query grounding by integrating hand motion with video-text features, achieving significant performance improvements on the Ego4D NLQ v2 benchmark, particularly for Hand-Object Interaction (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3). This approach highlights the importance of hand motion in understanding context beyond visual appearance.

Key Points

Introduces a hand-trajectory encoder for improved NLQ grounding.
Achieves +2.54 R1@IoU=0.3 on Hand-Object Interaction queries.
Records +4.32 R1@IoU=0.3 for Quantity/State queries.
Demonstrates hand motion's significance in query context.
Utilizes cross-attention fusion with adaptive gating.

Article Excerpt

From source RSS / original summary

arXiv:2606. 02962v1 Announce Type: new Abstract: Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.

We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2. 54 R1@IoU=0. 3) and Quantity/State queries (+4. 32 R1@IoU=0. 3), indicating that hand trajectory provides grounding cues beyond appearance alone.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Fabian Degen, Oishi Deb, Jindong Gu, Junchi Yu, Samuele Marro, Philip Torr, Jialin Yu

4h ago

Original

Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records

AI Summary

Plan2Map introduces a 208-case benchmark for reconstructing geospatial boundaries from UK planning documents. The GeoPlanAgent system achieves a mean IoU of 0.736, significantly outperforming baseline models, highlighting the challenges in localization and map registration.

#Agent #AI Coding #Inference