Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU
Quick Answer
Rigel reveals that Apple's Metal 4.1 fp8 matmul2d is emulated rather than hardware-accelerated, achieving only 0.94x the throughput of fp16.
Quick Take
Rigel reveals that Apple's Metal 4.1 fp8 matmul2d is emulated rather than hardware-accelerated, achieving only 0.94x the throughput of fp16. The findings indicate that this operation runs entirely on GPU shader cores without a dedicated matrix datapath, impacting performance expectations for developers using the M4 Max GPU.
Key Points
- Metal 4.1's fp8 matmul2d sustains 0.94x throughput of fp16 despite lower operand bytes.
- The operation executes entirely on GPU shader cores without dedicated matrix datapath.
- Accumulation occurs in at least fp32, contradicting expectations from the specification.
- A hand-fused GEMM kernel outperforms the decomposed path by 6.5-12.9% in cache-resident scenarios.
- All findings are reproducible from MIT-licensed code and CSVs.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 12765v1 Announce Type: new Abstract: Apple's Metal 4. 1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads.
We present Rigel, an empirical characterization of this path on a single Apple M4 Max (a pre-neural-accelerator generation). Using a checksum-gated, provenance-tracked microbenchmark harness, Rigel recovers eleven facts the v4. 1 specification hides or contradicts. The headline finding: the Metal 4. 1 fp8 (E4M3) matmul2d is emulated, not accelerated: it sustains 0. 94x the throughput of fp16 despite reading half the operand bytes, so on M4 it is a memory-footprint feature, not a performance feature.
We further show, via a three-signal triangulation (throughput ceiling, comparison against simdgroup_matrix, and per-rail power attribution), that matmul2d executes entirely on the GPU shader cores with no dedicated matrix datapath and no evidence of Apple Neural Engine routing; that it accumulates in >=fp32; and we reconstruct the opaque 8x8 cooperative_tensor fragment layout Apple documents nowhere. Acting on the characterization, a hand-fused GEMM + bias + GELU kernel beats the decomposed path by +6. 5-12.
9% in the cache-resident regime. All findings are reproducible from committed MIT-licensed code and per-cell CSVs.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.


