When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing
Quick Take
Sparse MoE networks improve vision classification when sufficient compute is routed, especially with multi-expert setups.
Key Points
- Study evaluates sparse routing on four vision benchmarks.
- Positive accuracy requires substantial compute routing fraction.
- Soft MoE variant improves performance over dense baselines.
📖 Reader Mode
~2 min readAbstract:Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $\rho$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top-$k$ -- holding architecture, initialization, and $\rho$ fixed -- reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: this https URL.
| Comments: | 24 pages (main + appendix), 8 figures, 18 tables. Under review at TMLR. Code and aggregate results: this https URL |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.15484 [cs.CV] |
| (or arXiv:2605.15484v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.15484 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Libo Sun [view email]
[v1]
Fri, 15 May 2026 00:01:11 UTC (4,516 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.