We present ScatterMoE, an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs. ScatterMoE builds upon existing implementations, and overcoming some of the limitations to improve inference and training speed, and memory footprint. This implementation achieves this by avoiding padding and making excessive copies of the input. We introduce ParallelLinear, the main component we use to build our implementation and the various kernels used to speed up the operation. We benchmark our implementation against Megablocks, and show that it enables a higher throughput and lower memory footprint. We also show how ParallelLinear enables extension of the Mixture-of-Experts concept by demonstrating with an implementation of Mixture of Attention.
翻译:我们提出了ScatterMoE,一种在GPU上实现的稀疏专家混合(SMoE)方案。ScatterMoE在现有实现基础上构建,通过克服某些限制来提升推理与训练速度,并减少内存占用。该实现通过避免填充和减少输入数据的冗余复制来实现这些改进。我们引入了ParallelLinear——构建本实现的核心组件,以及用于加速运算的多种内核。我们将本实现与Megablocks进行基准测试,结果表明其能实现更高的吞吐量和更低的内存占用。我们还通过混合注意力机制的实现案例,展示了ParallelLinear如何支持专家混合概念的扩展。