We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model and task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. 2) Sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigates the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including video classification, image classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L variant focusing on video tasks that achieves new state-of-the-art in zero-shot video classification: 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 68.3% on Kinetics-700, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.
翻译:我们提出一种名为集成多模态感知(Integrated Multimodal Perception, IMP)的简洁可扩展多模态多任务训练与建模方案。IMP将图像、视频、文本及音频等多模态输入整合至单一Transformer编码器中,仅保留极少的模态专用组件。该方法创新性地融合了交替梯度下降(Alternating Gradient Descent, AGD)与混合专家模型(Mixture-of-Experts, MoE),实现模型与任务的高效扩展。通过大量实验研究,我们获得以下关键发现:1) 对多样模态、损失函数及任务以不同输入分辨率交替执行梯度下降更新,可有效提升模型性能;2) 在单一模态无关编码器上引入MoE稀疏化处理能显著提升效果,不仅超越使用模态专用编码器或附加融合层的密集模型,更大幅缓解了模态间的冲突。IMP在视频分类、图像分类、图文检索及视频文本检索等多项下游任务中均展现竞争力。尤其值得关注的是,我们训练的IMP-MoE-L稀疏变体在视频任务上实现零样本分类新突破:Kinetics-400达77.0%、Kinetics-600达76.8%、Kinetics-700达68.3%,相较此前最优方法分别提升5%、6.7%及5.8%,且训练计算成本仅为其15%。