Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on… | AI Deep Signal

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

arXiv cs.CL·Ramchand Kumaresan

6/12/2026

·~2 min·6/12/2026·en·7

Quick Answer

Rigel reveals that Apple's Metal 4.1 fp8 matmul2d is emulated rather than hardware-accelerated, achieving only 0.94x the throughput of fp16.

Quick Take

The findings indicate that this operation runs entirely on GPU shader cores without a dedicated matrix datapath, impacting performance expectations for developers using the M4 Max GPU.

Key Points

Metal 4.1's fp8 matmul2d sustains 0.94x throughput of fp16 despite lower operand bytes.
The operation executes entirely on GPU shader cores without dedicated matrix datapath.
Accumulation occurs in at least fp32, contradicting expectations from the specification.
A hand-fused GEMM kernel outperforms the decomposed path by 6.5-12.9% in cache-resident scenarios.
All findings are reproducible from MIT-licensed code and CSVs.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

Apple's Metal 4. 1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads. We present Rigel, an empirical characterizat

Read the full article on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Isabel Xu (The Overlake School), Cynthia Xu (The Overlake School), Rachel Ren (Edwards Vacuum Inc.), Cong Guo (The University of Memphis), Jiacheng Ding (The University of Memphis)

5d ago

FeaturedOriginal

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

AI Summary

TriAgent introduces a cost-efficient multi-agent system for financial sentiment analysis, combining VADER, FinBERT, and Qwen2.5. It achieves an F1 score of ~0.87 with significant savings of $9.3M/year at a 10M-user scale compared to GPT-4o-mini, while also detecting hallucinations with an AUC of 0.90.

#LLM #Agent #AI Startup #Enterprise AI

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Related in this space

Synthetic Data Generation for Financial AI Research with NVIDIA NeMo

Deploy a Production-Ready NVIDIA AI-Q Blueprint on Oracle Cloud Infrastructure

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Multi-Agent Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Related in this space

Synthetic Data Generation for Financial AI Research with NVIDIA NeMo

Deploy a Production-Ready NVIDIA AI-Q Blueprint on Oracle Cloud Infrastructure

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis