Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

arXiv cs.CV·Abdul Basit, Saim Rehman, Muhammad Shafique

3d ago

·~2 min·5/14/2026·en·1

Quick Take

Scale-Gest is a scalable framework for adaptive on-device gesture detection optimizing energy and performance.

Key Points

Utilizes a family of tiny-YOLO architectures.
Introduces ACE profiles for device calibration.
Reduces energy consumption by 4x while maintaining detection performance.

📖 Reader Mode

~2 min read

[Submitted on 16 Mar 2026]

View PDF HTML (experimental)

Abstract:Realizing on-device ML-based gesture detection under tight real-time performance, energy and memory constraints is challenging, especially when considering mobile devices with varying battery-power levels. Existing EdgeAI deployments typically rely on a single fixed detector, limiting optimization opportunities. We present Scale-Gest, a novel run-time adaptive gesture detection framework that expands the detector space into a dense family of tiny-YOLO architectures. We introduce multiple novel device-calibrated ACE (Accuracy-Complexity-Energy) profiles by analyzing different model-resolution-stride operating points. A lightweight run-time controller selects an appropriate ACE mode under user-defined and battery constraints, while a motion-aware hand-gesture-tracking ROI gate crops the input for reduced complexity detection. To evaluate performance of our system in real-world car driving scenarios, we introduce a temporally-annotated Driver Simulated Gesture (DSG-18) dataset. Scale-Gest maintains event-level F1 while significantly reducing energy and latency compared to single-detector approaches. On a battery-powered laptop running gesture streams, our ACE controller reduces per-frame energy by 4x (from 6.9 mJ to 1.6 mJ) while maintaining high gesture-detection performance (event-level F1 = 0.8-0.9) and low mean latency (6 ms).

Comments:	7 pages, 11 figures, Accepted to DAC 2026
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO); Image and Video Processing (eess.IV)
ACM classes:	I.2.10
Cite as:	arXiv:2605.12506 [cs.CV]
	(or arXiv:2605.12506v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.12506 arXiv-issued DOI via DataCite

Submission history

From: Abdul Basit [view email]
[v1] Mon, 16 Mar 2026 10:12:26 UTC (14,333 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Scale-Gest: Scalable Model-Space Synthesis and Runtime Selection for On-Device Gesture Detection

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

Related in this space

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

China bypasses US GPU bans with 1.54-exaflops 'LineShine' supercomputer — CPU-only monster packs 2.4 million Huawei-designed Armv9 cores