Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Extensive experiments show that our approach has promising scaling trend along model size. MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. Furthermore, our approach is compatible with post-training quantization methods and the advantage further amplifies when memory-constraint goes lower. Given the same amount of expert memory footprint of 3.4GB and combined with post-training quantization, MoTE outperforms MoE-LLaVA by a gain of 4.3% average accuracy on end tasks, demonstrating its effectiveness and potential for memory-constrained devices.
翻译:大型多模态混合专家模型(MoEs)通过扩展模型规模有效提升性能,同时保持固定的激活参数量。然而,先前研究在稀疏向上扩展过程中主要采用全精度专家。尽管这些模型在终端任务上表现出优越性能,但大量专家会带来更高的内存占用,这对边缘设备部署构成了重大挑战。本研究提出MoTE——一种可扩展且内存高效的方法,用于从稠密检查点训练混合三元专家模型。我们主张在向上扩展阶段训练更多低精度专家,而非训练较少的高精度专家。具体而言,我们使用预训练的前馈网络作为共享专家,并训练参数取值范围为{-1, 0, 1}的三元路由专家。大量实验表明,该方法随模型规模扩大呈现良好的扩展趋势。MoTE在实现与全精度基线模型MoE-LLaVA相当性能的同时,具有更低的内存占用。此外,本方法兼容训练后量化技术,且在内存约束更严格时优势会进一步放大。在专家内存占用均为3.4GB的条件下,结合训练后量化技术,MoTE在终端任务上的平均准确率较MoE-LLaVA提升4.3%,证明了其在内存受限设备上的有效性与应用潜力。