Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU
Quick Take
The study advances behavioral recognition using head-mounted IMUs, introducing a new dataset and model for improved context awareness.
Key Points
- Defines five behavioral categories for AR applications.
- Constructs a 160K-sample Ego4D dataset.
- Proposes HiT-HAR model, outperforming previous methods.
Article Content
From source RSS / original summaryarXiv:2605. 27464v1 Announce Type: new Abstract: AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral-level recognition, defining five categories that balance AR application need with sensor observability.
To this end, we construct a 160K-sample Ego4D dataset with a four-tier quality assurance framework spanning 8 activity scenarios, and propose HiT-HAR, a 703K-parameter hierarchical model that outperforms prior head-mounted IMU models on five-class action and eight-class scenario recognition.
We further map the observability frontier of head-mounted IMU through per-class separability analysis, identifying which behavioral categories are reliably observable (Locomotion), which benefit from temporal context (Object Transfer, Task Operation), and where scenario-dependent signal overlap poses remaining challenges. Our results indicate that architectural choices exploiting temporal context and scenario structure outperform simply scaling model size.
The code and dataset are publicly available at https://github. com/Harvard-AI-and-Robotics-Lab/HiT-HAR.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer enables efficient, uncertainty-aware tuning of biomedical vision-language models with minimal parameter updates.
