The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these models become larger and more complex, they also become more challenging to train and deploy. One approach to addressing this challenge is the use of sparsely-gated mixture-of-experts (MoE) techniques, which divide the model into smaller, specialized sub-models that can jointly solve a task. In this paper, we explore the effectiveness of MoE in scaling vision-language models, demonstrating its potential to achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling VLMs. We hope our work will inspire further research into the use of MoE for scaling large-scale vision-language models and other multimodal machine learning applications.
翻译:自然语言处理(NLP)领域近年来取得了显著进展,尤其是在大规模视觉-语言模型(VLM)的开发方面。这些模型旨在弥合文本与视觉信息之间的鸿沟,从而实现对多媒体数据更全面的理解。然而,随着这些模型变得更大、更复杂,其训练与部署也面临更大挑战。应对该问题的一种方法是采用稀疏门控混合专家(MoE)技术,该技术将模型拆分为多个更小、更专业化的子模型,使其能够协同解决任务。本文探索了MoE在扩展视觉-语言模型中的有效性,证明了在等同计算成本的密集模型上,MoE模型能够在多个基准测试中达到最先进性能。我们的研究为稳定MoE模型训练、理解MoE对模型可解释性的影响、以及平衡扩展VLM时的计算性能权衡提供了宝贵见解。我们希望本工作能激励学界进一步探索MoE在扩展大规模视觉-语言模型及其他多模态机器学习应用中的潜力。