In this work, we introduce Context-Aware MultiModal Learner (CaMML), for tuning large multimodal models (LMMs). CaMML, a lightweight module, is crafted to seamlessly integrate multimodal contextual samples into large models, thereby empowering the model to derive knowledge from analogous, domain-specific, up-to-date information and make grounded inferences. Importantly, CaMML is highly scalable and can efficiently handle lengthy multimodal context examples owing to its hierarchical design. Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks. Remarkably, CaMML-13B achieves the state-of-the-art performance on over ten widely recognized multimodal benchmark datasets, surpassing LLaVA-1.5 (13B) with a noticeable margin, without integration of any external resources. Moreover, we have conducted extensive ablative studies to inspect the inner workings of CaMML and performed qualitative analyses to showcase its effectiveness in handling real-world challenging cases.
翻译:本文提出了一种上下文感知多模态学习器(Context-Aware MultiModal Learner, CaMML),用于微调大型多模态模型(LMMs)。CaMML是一个轻量级模块,旨在将多模态上下文样本无缝集成到大型模型中,从而赋予模型从相似、特定领域、最新信息中获取知识并做出基于上下文推理的能力。重要的是,CaMML具有高度可扩展性,因其分层设计能够高效处理冗长的多模态上下文示例。基于CaMML,我们开发了两款多模态模型CaMML-7B和CaMML-13B,它们在多项多模态任务的基准数据集上均展现出卓越性能。值得注意的是,CaMML-13B在超过十个广泛认可的多模态基准数据集上取得了最先进性能,以显著优势超越了LLaVA-1.5(13B),且无需集成任何外部资源。此外,我们开展了广泛的消融研究以剖析CaMML的内部机制,并通过定性分析展示了其在处理现实世界复杂案例中的有效性。