When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

arXiv cs.CV·Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin

4d ago

·~2 min·5/18/2026·en·1

Quick Take

Sparse MoE networks improve vision classification when sufficient compute is routed, especially with multi-expert setups.

Key Points

Study evaluates sparse routing on four vision benchmarks.
Positive accuracy requires substantial compute routing fraction.
Soft MoE variant improves performance over dense baselines.

📖 Reader Mode

~2 min read

[Submitted on 15 May 2026]

View PDF HTML (experimental)

Abstract:Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $\rho$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top-$k$ -- holding architecture, initialization, and $\rho$ fixed -- reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: this https URL.

Comments:	24 pages (main + appendix), 8 figures, 18 tables. Under review at TMLR. Code and aggregate results: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2605.15484 [cs.CV]
	(or arXiv:2605.15484v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.15484 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Libo Sun [view email]
[v1] Fri, 15 May 2026 00:01:11 UTC (4,516 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

Quick Take

Key Points

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CV

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLMs for Efficient Neural Architecture Search

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Related in this space

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines