Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

arXiv cs.CL·Zhiyuan Cheng, Longying Lai

6/11/2026

·~2 min·6/11/2026·en·1

Quick Answer

This paper shows that The Snapdragon X Elite's Hexagon NPU enables an energy-efficient, end-to-end Retrieval-Augmented Generation (RAG) pipeline, achieving 9.1x higher embedding throughput and 12.3x less energy compared to CPU.

Quick Take

In benchmarks, it delivers 18.1x faster prefilling and 4.0x lower latency, maintaining answer quality on par with CPU and GPU. This innovation paves the way for sustainable edge intelligence across mobile NPUs.

Key Points

NPU achieves 9.1x higher embedding throughput than CPU.
Delivers 18.1x faster LLM prefilling on a 120-query benchmark.
Uses 4.0x less system energy compared to CPU for end-to-end queries.
Maintains answer quality comparable to CPU and GPU in evaluations.
Sets a precedent for energy-efficient mobile AI solutions.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 11257v1 Announce Type: new Abstract: (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier.

We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages -- embedding, reranking, and LLM generation -- on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Isabel Xu (The Overlake School), Cynthia Xu (The Overlake School), Rachel Ren (Edwards Vacuum Inc.), Cong Guo (The University of Memphis), Jiacheng Ding (The University of Memphis)

5d ago

FeaturedOriginal

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

AI Summary

TriAgent introduces a cost-efficient multi-agent system for financial sentiment analysis, combining VADER, FinBERT, and Qwen2.5. It achieves an F1 score of ~0.87 with significant savings of $9.3M/year at a 10M-user scale compared to GPT-4o-mini, while also detecting hallucinations with an AUC of 0.90.

#LLM #Agent #AI Startup #Enterprise AI

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Related in this space

Synthetic Data Generation for Financial AI Research with NVIDIA NeMo

Deploy a Production-Ready NVIDIA AI-Q Blueprint on Oracle Cloud Infrastructure

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Multi-Agent Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Related in this space

Synthetic Data Generation for Financial AI Research with NVIDIA NeMo

Deploy a Production-Ready NVIDIA AI-Q Blueprint on Oracle Cloud Infrastructure

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis