Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, and caption generation. Traditional approaches rely on modality-specific encoders and late fusion techniques, which can hinder scalability and flexibility when adapting to new tasks or modalities. To address these limitations, we introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. This method treats all inputs and outputs as sequential frames in a video, enabling seamless integration of modalities and effective knowledge transfer across tasks. Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text, demonstrating the model's ability to generalize across modalities with minimal adaptation. We show that task reformulation can significantly simplify multimodal model design across various tasks, laying the groundwork for more generalized multimodal foundation models.
翻译:多模态学习涉及整合文本、图像、音频和视频等多种模态的信息,对于视觉问答、跨模态检索和字幕生成等众多复杂任务至关重要。传统方法依赖于特定模态编码器和后期融合技术,这在适应新任务或新模态时可能阻碍可扩展性和灵活性。为克服这些局限,我们引入了一种新颖框架,将任务重构的概念从自然语言处理领域扩展至多模态学习。我们提出将多样化的多模态任务统一重构为下一帧预测问题,使单一模型能够处理不同模态而无需特定模态组件。该方法将所有输入和输出视为视频中的连续帧,从而实现模态的无缝整合与跨任务的有效知识迁移。我们在包括文本到文本、图像到文本、视频到视频、视频到文本及音频到文本的一系列任务上评估了该方法,证明了模型能以最小适应性实现跨模态泛化。我们表明任务重构能显著简化各类任务中的多模态模型设计,为更通用的多模态基础模型奠定基础。