Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite
Quick Answer
This paper shows that The Snapdragon X Elite's Hexagon NPU enables an energy-efficient, end-to-end Retrieval-Augmented Generation (RAG) pipeline, achieving 9.1x higher embedding throughput and 12.3x less energy compared to CPU.
Quick Take
The Snapdragon X Elite's Hexagon NPU enables an energy-efficient, end-to-end (RAG) pipeline, achieving 9.1x higher embedding throughput and 12.3x less energy compared to CPU. In benchmarks, it delivers 18.1x faster LLM prefilling and 4.0x lower latency, maintaining answer quality on par with CPU and GPU. This innovation paves the way for sustainable edge intelligence across mobile NPUs.
Key Points
- NPU achieves 9.1x higher embedding throughput than CPU.
- Delivers 18.1x faster LLM prefilling on a 120-query benchmark.
- Uses 4.0x less system energy compared to CPU for end-to-end queries.
- Maintains answer quality comparable to CPU and GPU in evaluations.
- Sets a precedent for energy-efficient mobile AI solutions.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11257v1 Announce Type: new Abstract: (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier.
We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages -- embedding, reranking, and LLM generation -- on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9. 1x higher embedding throughput and 12. 3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.
1x faster LLM prefilling, 4. 0x lower end-to-end query latency, and 4. 0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1. 7x slower than CPU and uses 6. 5x more energy than the NPU. A GPT-4. 1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9. 32 vs. 8. 95 vs. 9. 03 on a 1-10 rubric), with 86. 7% of queries scoring identically across all three backends.
On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression -- a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.


