Recent advancements in large-scale models have showcased remarkable generalization capabilities in various tasks. However, integrating multimodal processing into these models presents a significant challenge, as it often comes with a high computational burden. To address this challenge, we introduce a new parameter-efficient multimodal tuning strategy for large models in this paper, referred to as Multimodal Infusion Tuning (MiT). MiT leverages decoupled self-attention mechanisms within large language models to effectively integrate information from diverse modalities such as images and acoustics. In MiT, we also design a novel adaptive rescaling strategy at the head level, which optimizes the representation of infused multimodal features. Notably, all foundation models are kept frozen during the tuning process to reduce the computational burden(only 2.5\% parameters are tunable). We conduct experiments across a range of multimodal tasks, including image-related tasks like referring segmentation and non-image tasks such as sentiment analysis. Our results showcase that MiT achieves state-of-the-art performance in multimodal understanding while significantly reducing computational overhead(10\% of previous methods). Moreover, our tuned model exhibits robust reasoning abilities even in complex scenarios.
翻译:近期在大规模模型方面的进展展示了其在各类任务中卓越的泛化能力。然而,将多模态处理集成到这些模型中仍是一项重大挑战,因为这通常伴随着高昂的计算负担。为解决这一问题,本文提出了一种面向大模型的新型参数高效多模态调优策略,称为多模态注入调优(MiT)。MiT利用大语言模型中的解耦自注意力机制,有效整合来自图像和音频等不同模态的信息。在MiT中,我们还设计了一种新颖的头部级别自适应重缩放策略,以优化注入的多模态特征表示。值得注意的是,调优过程中所有基础模型均保持冻结状态,以降低计算负担(仅2.5%的参数可调)。我们在一系列多模态任务上进行了实验,包括图像相关任务(如指代分割)和非图像任务(如情感分析)。结果表明,MiT在多模态理解方面达到了现有最优性能,同时显著降低了计算开销(为先前方法的10%)。此外,即使面对复杂场景,我们的调优模型仍展现出稳健的推理能力。