We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model \& task scaling. We conduct extensive empirical studies about IMP and reveal the following key insights: 1) performing gradient descent updates by alternating on diverse heterogeneous modalities, loss functions, and tasks, while also varying input resolutions, efficiently improves multimodal understanding. 2) model sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigating the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including image classification, video classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L focusing on video tasks that achieves new state-of-the-art in zero-shot video classification. Our model achieves 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 76.8% on Kinetics-700 zero-shot classification accuracy, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.
翻译:我们提出集成多模态感知(IMP),一种简单且可扩展的多模态多任务训练与建模方法。IMP将图像、视频、文本和音频等多模态输入整合至单一Transformer编码器中,仅需极少量模态专用组件。IMP采用交替梯度下降(AGD)与混合专家模型(MoE)相结合的新型设计,实现高效的模型与任务扩展。通过对IMP的广泛实证研究,我们揭示以下关键发现:1) 在多样异构模态、损失函数与任务上交替执行梯度下降更新,同时改变输入分辨率,能有效提升多模态理解能力;2) 在单一模态无关编码器上应用MoE进行模型稀疏化,可显著提升性能,超越使用模态专用编码器或额外融合层的稠密模型,并大幅缓解模态间冲突。IMP在图像分类、视频分类、图像-文本及视频-文本检索等多种下游任务中均取得具有竞争力的性能。最值得注意的是,我们训练了面向视频任务的稀疏IMP-MoE-L模型,在零样本视频分类中达到新的最优水平。该模型在Kinetics-400、Kinetics-600和Kinetics-700数据集上的零样本分类准确率分别达到77.0%、76.8%和76.8%,较此前最优结果分别提升+5%、+6.7%和+5.8%,而总训练计算成本仅为后者的15%。