WristCompass: Kinematic Coupling as a Learnable Visual Concept for Ego-Camera Orientation
Quick Take
WristCompass introduces kinematic coupling dynamics for ego-camera orientation recovery, outperforming traditional scene geometry methods. Trained on tabletop manipulation, it achieves a median geodesic error of 14.3° on Epic Kitchens, approaching the performance of a 1B-parameter model with only 200K GRU parameters.
Key Points
- WristCompass leverages kinematic coupling for ego-camera orientation recovery.
- Outperforms VGGT model, which scores worse than constant predictors on TACO benchmark.
- Achieves 14.3° median geodesic error on Epic Kitchens cooking video.
- Utilizes 4D inter-wrist features instead of 126D full hand keypoints.
- Zero-shot transfer capability across datasets due to anatomical foundations.
Article Content
From source RSS / original summaryarXiv:2605. 30671v1 Announce Type: new Abstract: Recovering ego-camera orientation from manipulation video is a prerequisite for disentangling hand motion from camera motion, a key step in imitation learning from egocentric demonstrations. The obvious approach, inferring orientation from scene geometry, fails when hands occlude the frame: VGGT, a 1B-parameter scene reconstruction model, scores worse than a constant predictor on the TACO benchmark.
We identify an alternative visual concept that is present precisely when scene geometry is absent: kinematic coupling dynamics, the structured physical relationship between wrist motion and camera orientation imposed by the arm-shoulder-head chain.
We find that this concept is compact (4D inter-wrist features outperform 126D full hand keypoints), temporal (requiring a GRU over short windows rather than per-frame retrieval), and physically grounded (transferring zero-shot across datasets because it is rooted in anatomy rather than scene appearance). Trained only on tabletop manipulation, WristCompass transfers zero-shot to Epic Kitchens cooking video, achieving 14.
3$^\circ$ median geodesic error and approaching the performance of a 1B-parameter scene model at 200K GRU parameters.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.
