Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection
Quick Take
Scale-Gest is a scalable framework for adaptive on-device gesture detection optimizing energy and performance.
Key Points
- Utilizes a family of tiny-YOLO architectures.
- Introduces ACE profiles for device calibration.
- Reduces energy consumption by 4x while maintaining detection performance.
📖 Reader Mode
~2 min readAbstract:Realizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel run-time adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A lightweight run-time controller selects an appropriate ACE mode under user-defined and battery constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by 4x (from 6.9 mJ to 1.6 mJ) while maintaining high gesture-detection performance (event-level F1 = 0.8-0.9) and low mean latency (6 ms).
| Comments: | 7 pages, 11 figures, Accepted to DAC 2026 |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Image and Video Processing (eess.IV) |
| ACM classes: | I.2.10 |
| Cite as: | arXiv:2605.12506 [cs.CV] |
| (or arXiv:2605.12506v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.12506 arXiv-issued DOI via DataCite |
Submission history
From: Abdul Basit [view email]
[v1]
Mon, 16 Mar 2026 10:12:26 UTC (14,333 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers
CoReDiT enhances Diffusion Transformers by optimizing token pruning for efficiency and quality.
