Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost. A key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens). In this paper, we present a comprehensive study of routers in MoEs for computer vision tasks. We introduce a unified MoE formulation that subsumes different MoEs with two parametric routing tensors. This formulation covers both sparse MoE, which uses a binary or hard assignment between experts and tokens, and soft MoE, which uses a soft assignment between experts and weighted combinations of tokens. Routers for sparse MoEs can be further grouped into two variants: Token Choice, which matches experts to each token, and Expert Choice, which matches tokens to each expert. We conduct head-to-head experiments with 6 different routers, including existing routers from prior work and new ones we introduce. We show that (i) many routers originally developed for language modeling can be adapted to perform strongly in vision tasks, (ii) in sparse MoE, Expert Choice routers generally outperform Token Choice routers, and (iii) soft MoEs generally outperform sparse MoEs with a fixed compute budget. These results provide new insights regarding the crucial role of routers in vision MoE models.
翻译:混合专家模型(MoE)是一种在不显著增加计算成本的前提下扩展模型容量的可行方法。MoE的关键组件是路由机制,它决定哪些参数子集(专家)处理哪些特征嵌入(令牌)。本文对计算机视觉任务中MoE的路由机制进行了全面研究。我们提出了一种统一的MoE公式,该公式通过两个参数化路由张量涵盖了不同的MoE变体。该公式既包含稀疏MoE(在专家与令牌之间采用二元或硬分配),也包含软MoE(在专家与令牌加权组合之间采用软分配)。稀疏MoE的路由机制可进一步分为两种变体:令牌选择(为每个令牌匹配专家)和专家选择(为每个专家匹配令牌)。我们对6种不同的路由机制进行了头对头实验,包括以往研究中的现有路由机制以及我们新引入的路由机制。研究结果表明:(i)许多最初为语言建模开发的路由机制可经调整后在视觉任务中表现优异;(ii)在稀疏MoE中,专家选择路由机制通常优于令牌选择路由机制;(iii)在固定计算预算下,软MoE通常优于稀疏MoE。这些结果为理解路由机制在视觉MoE模型中的关键作用提供了新见解。