Despite impressive advancements in multimodal compositional reasoning approaches, they are still limited in their flexibility and efficiency by processing fixed modality inputs while updating a lot of model parameters. This paper tackles these critical challenges and proposes CREMA, an efficient and modular modality-fusion framework for injecting any new modality into video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio) from given videos without extra human annotation by leveraging existing pre-trained models. Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. It projects diverse modality features to the LLM token embedding space, allowing the model to integrate different data types for response generation. Furthermore, we propose a fusion module designed to compress multimodal queries, maintaining computational efficiency in the LLM while combining additional modalities. We validate our method on video-3D, video-audio, and video-language reasoning tasks and achieve better/equivalent performance against strong multimodal LLMs, including BLIP-2, 3D-LLM, and SeViLA while using 96% fewer trainable parameters. We provide extensive analyses of CREMA, including the impact of each modality on reasoning domains, the design of the fusion module, and example visualizations.
翻译:尽管多模态组合式推理方法取得了显著进展,但其在处理固定模态输入时需要更新大量模型参数,因此在灵活性和效率上仍存在局限。本文针对这些关键挑战,提出CREMA——一种高效且模块化的模态融合框架,可将任意新模态注入视频推理过程。我们首先利用现有预训练模型,从给定视频中增强多种信息模态(如光流、3D点云、音频),无需额外人工标注。其次,我们引入一个查询变换器,其配备与每种可用模态相关联的多个参数高效模块,将多样化的模态特征投影至大语言模型(LLM)的令牌嵌入空间,从而使模型能够整合不同类型的数据以生成响应。此外,我们设计了一种融合模块,用于压缩多模态查询,在结合额外模态的同时保持LLM的计算效率。我们在视频-3D、视频-音频和视频-语言推理任务上验证了该方法,在与包括BLIP-2、3D-LLM和SeViLA在内的强多模态大语言模型相比时,实现了同等或更优的性能,同时减少了96%的可训练参数。我们提供了对CREMA的广泛分析,包括各模态对推理域的影响、融合模块的设计以及示例可视化。