BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal
Quick Answer
BaseRT is a native Metal inference runtime for large language models on Apple Silicon, achieving up to 1.56x higher decode throughput than llama.cpp and 1.35x higher than MLX.
Quick Take
BaseRT is a native Metal inference runtime for large language models on Apple Silicon, achieving up to 1.56x higher decode throughput than llama.cpp and 1.35x higher than MLX. It supports various model families and quantization formats, establishing Apple Silicon as a leading platform for on-device inference, crucial for privacy and latency-sensitive applications.
Key Points
- BaseRT achieves highest inference throughput on Apple Silicon to date.
- Supports model families across eight quantization formats from Q2 to FP16.
- Evaluated Qwen3, Llama 3.2, and Gemma 4 on M3 and M4 Pro devices.
- Delivers consistent best-in-class throughput for models ranging from sub-1B to 30B parameters.
- Publicly available on GitHub, enabling optimized local inference.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 00501v1 Announce Type: new Abstract: We present BaseRT, a native Metal inference runtime for large language models (LLMs) on Apple Silicon, and report the highest inference throughput on this hardware to date. Existing runtimes, including llama. cpp and MLX-based frameworks, incur overhead from abstractions not designed for Metal's execution model or Apple Silicon's unified memory topology.
By building natively on Metal with chip-specific kernel fusion, unified memory-aware optimisation, and custom dispatch logic, BaseRT recovers performance that framework-based approaches leave on the table. BaseRT supports a wide range of model families across eight quantisation formats (Q2 to FP16) on all Apple M-series devices. In this paper, we evaluate the Qwen3, Llama 3. 2, and Gemma 4 families at Q4 and Q8 quantisation on M3 and M4 Pro devices. BaseRT achieves up to 1. 56x higher decode throughput than llama.
cpp and up to 1. 35x higher than MLX, with substantially larger margins on prefill for mixture-of-experts models, delivering consistent best-in-class throughput from sub-1B to 30B parameter models.
These results establish Apple Silicon as a more capable inference platform than previously reported, with direct implications for the emerging edge inference paradigm: as privacy requirements, latency constraints, and cloud cost pressures drive inference toward on-device deployment, performance-optimised local runtimes are a critical enabling layer for this transition. BaseRT is publicly available at https://github. com/basecompute/baseRT
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.


