Scaling Vision-Language Models with Sparse Mixture of Experts

The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these models become larger and more complex, they also become more challenging to train and deploy. One approach to addressing this challenge is the use of sparsely-gated mixture-of-experts (MoE) techniques, which divide the model into smaller, specialized sub-models that can jointly solve a task. In this paper, we explore the effectiveness of MoE in scaling vision-language models, demonstrating its potential to achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling VLMs. We hope our work will inspire further research into the use of MoE for scaling large-scale vision-language models and other multimodal machine learning applications.

翻译：自然语言处理（NLP）领域近年来取得了显著进展，尤其是在大规模视觉-语言模型（VLM）的开发方面。这些模型旨在弥合文本与视觉信息之间的鸿沟，从而实现对多媒体数据更全面的理解。然而，随着这些模型变得更大、更复杂，其训练与部署也面临更大挑战。应对该问题的一种方法是采用稀疏门控混合专家（MoE）技术，该技术将模型拆分为多个更小、更专业化的子模型，使其能够协同解决任务。本文探索了MoE在扩展视觉-语言模型中的有效性，证明了在等同计算成本的密集模型上，MoE模型能够在多个基准测试中达到最先进性能。我们的研究为稳定MoE模型训练、理解MoE对模型可解释性的影响、以及平衡扩展VLM时的计算性能权衡提供了宝贵见解。我们希望本工作能激励学界进一步探索MoE在扩展大规模视觉-语言模型及其他多模态机器学习应用中的潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日