CREMA：通过多模态模块化融合实现通用且高效的视频-语言推理 (CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion)

Despite impressive advancements in recent multimodal reasoning approaches, they are still limited in flexibility and efficiency, as these models typically process only a few fixed modality inputs and require updates to numerous parameters. This paper tackles these critical challenges and proposes CREMA, a generalizable, highly efficient, and modular modality-fusion framework that can incorporate any new modality to enhance video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio, thermal heatmap, and touch map) from given videos without extra human annotation by leveraging sensors or existing pre-trained models. Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. It projects diverse modality features to the LLM token embedding space, allowing the model to integrate different data types for response generation. Furthermore, we propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy. It helps compress information across various assisting modalities, maintaining computational efficiency in the LLM while improving performance. We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including conventional VideoQA and Video-Audio/3D/Touch/Thermal QA, and achieve better/equivalent performance against strong multimodal LLMs, including OneLLM, BLIP-2, and SeViLA while reducing over 90% trainable parameters. We provide extensive analyses of CREMA, including the impact of each modality on reasoning domains, the design of the fusion module, and example visualizations.

翻译：尽管近期多模态推理方法取得了显著进展，但其灵活性与效率仍存在局限，因为这些模型通常仅能处理少数固定的模态输入，且需要更新大量参数。本文针对这些关键挑战提出了CREMA——一个通用、高效且模块化的模态融合框架，能够整合任意新模态以增强视频推理能力。我们首先通过传感器或现有预训练模型，从给定视频中无需额外人工标注即可增强多种信息模态（如光流、3D点云、音频、热力图及触觉图）。接着，我们引入一种查询Transformer，其配备多个与各可用模态关联的参数高效模块。该架构将多样模态特征投影至大语言模型的词元嵌入空间，使模型能够整合不同数据类型以生成响应。此外，我们提出一种由轻量级融合模块与模态序列训练策略支撑的新型渐进式多模态融合设计。该设计有助于压缩各辅助模态的信息，在保持大语言模型计算效率的同时提升性能。我们在7个由多模态辅助的视频-语言推理任务上验证了该方法，涵盖传统视频问答及视频-音频/3D/触觉/热力问答，相较于OneLLM、BLIP-2、SeViLA等强大多模态大语言模型取得了更优或相当的性能，同时减少了超过90%的可训练参数量。我们对CREMA进行了全面分析，包括各模态对推理领域的影响、融合模块的设计以及案例可视化。