视觉中的混合专家模型：路由、优化与泛化 (Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization)

Mixture-of-Experts (MoE) architectures enable conditional computation by routing inputs to multiple expert subnetworks and are often motivated as a mechanism for scaling large language models. In this project, we instead study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization. We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity. Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization, avoiding expert collapse. To analyze generalization, we compute Hessian-based sharpness metrics at convergence, including the largest eigenvalue and trace of the loss Hessian, evaluated on both training and test data. We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance. Complementary loss surface perturbation analyses reveal qualitative differences in non-local behavior under finite parameter perturbations between dense and MoE models, which help contextualize curvature-based measurements without directly explaining validation accuracy. We further evaluate empirical inference efficiency and show that naively implemented conditional routing does not yield inference speedups on modern hardware at this scale, highlighting the gap between theoretical and realized efficiency in sparse MoE models.

翻译：混合专家（MoE）架构通过将输入路由至多个专家子网络实现条件计算，常被用作扩展大型语言模型的机制。本项目则研究图像分类场景中MoE的行为特性，重点关注预测性能、专家利用率和泛化能力。我们在CIFAR10数据集上，在可比较的模型容量下对比了稠密分类头、SoftMoE分类头和SparseMoE分类头。两种MoE变体均通过正则化保持专家均衡利用（避免专家坍缩），其验证准确率略高于稠密基线。为分析泛化特性，我们计算了收敛时基于Hessian的锐度度量（包括损失Hessian矩阵的最大特征值和迹），并在训练与测试数据上进行评估。研究发现：按这些度量标准，SoftMoE表现出更高锐度，而稠密模型与SparseMoE处于相近的曲率区间，尽管所有模型均获得相当的泛化性能。补充性的损失曲面扰动分析揭示了稠密模型与MoE模型在有限参数扰动下非局部行为的定性差异，这有助于在未直接解释验证准确率的情况下，为基于曲率的测量提供背景参照。我们进一步评估了实际推理效率，结果表明：在此规模下，简单实现的条件路由无法在现代硬件上获得推理加速，这凸显了稀疏MoE模型在理论效率与实际效率之间的差距。