Mixture-of-Experts (MoE) models have shown promising potential for parameter-efficient scaling across domains. However, their application to image classification remains limited, often requiring billion-scale datasets to be competitive. In this work, we explore the integration of MoE layers into image classification architectures using open datasets. We conduct a systematic analysis across different MoE configurations and model scales. We find that moderate parameter activation per sample provides the best trade-off between performance and efficiency. However, as the number of activated parameters increases, the benefits of MoE diminish. Our analysis yields several practical insights for vision MoE design. First, MoE layers most effectively strengthen tiny and mid-sized models, while gains taper off for large-capacity networks and do not redefine state-of-the-art ImageNet performance. Second, a Last-2 placement heuristic offers the most robust cross-architecture choice, with Every-2 slightly better for Vision Transform (ViT), and both remaining effective as data and model scale increase. Third, larger datasets (e.g., ImageNet-21k) allow more experts, up to 16, for ConvNeXt to be utilized effectively without changing placement, as increased data reduces overfitting and promotes broader expert specialization. Finally, a simple linear router performs best, suggesting that additional routing complexity yields no consistent benefit.
翻译:专家混合(MoE)模型在跨领域参数高效扩展方面展现出巨大潜力,但其在图像分类中的应用仍较为有限,通常需要十亿级数据集才能具备竞争力。本研究探索了在开放数据集上将MoE层集成到图像分类架构中的方法。我们针对不同MoE配置和模型规模进行了系统性分析,发现每个样本激活适量参数能在性能与效率间取得最佳平衡。然而,随着激活参数数量的增加,MoE的优势逐渐减弱。我们的分析为视觉MoE设计提供了若干实践洞见:首先,MoE层对微型和中型模型的增强效果最为显著,而对于大容量网络,其增益逐渐收敛且未能重新定义ImageNet的最先进性能;其次,"最后两层"的放置启发式策略是最稳健的跨架构选择,对于Vision Transformer(ViT)而言"每隔两层"策略略优,且这两种策略在数据和模型规模增长时均保持有效;第三,更大规模数据集(如ImageNet-21k)可使ConvNeXt在不改变放置策略的情况下有效利用多达16个专家,因为数据量的增加减少了过拟合并促进了更广泛的专家专业化;最后,简单的线性路由器表现最佳,这表明增加路由复杂度并不能带来持续收益。