Vision-language pre-training and instruction tuning have demonstrated general-purpose capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art large language models (LLMs). In this paper, we introduce a simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities without extensive modality-specific customization. To facilitate instruction-modality fine-tuning, we collect high-quality instruction tuning data in an automatic and scalable manner, composed of 24K QA samples for audio and 250K QA samples for 3D. Leveraging instruction-aware representations, our model performs comparably with leading-edge counterparts without the need of extensive modality-specific pre-training or customization. Furthermore, our approach demonstrates cross-modal reasoning abilities across two or more input modalities, despite each modality projection being trained individually. To study the model's cross-modal abilities, we contribute a novel Discriminative Cross-modal Reasoning (DisCRn) evaluation task, comprising 9K audio-video QA samples and 28K image-3D QA samples that require the model to reason discriminatively across disparate input modalities.
翻译:视觉语言预训练和指令微调通过将视觉编码器与最先进的大型语言模型(LLMs)对齐,已在二维视觉推理任务中展现出通用能力。本文提出一种简单而有效的跨模态框架,该框架基于冻结的LLMs构建,无需大量针对特定模态的自定义即可集成多种模态。为促进指令-模态微调,我们以自动化和可扩展的方式收集了高质量的指令微调数据,包括2.4万个音频问答样本和25万个三维问答样本。利用指令感知表示,我们的模型在无需大量模态特定预训练或自定义的情况下,性能与领先的同类模型相当。此外,尽管每个模态投影独立训练,我们的方法在两种或多种输入模态之间展现了跨模态推理能力。为研究模型的跨模态能力,我们提出了一种新颖的判别式跨模态推理(DisCRn)评估任务,包含9千个音频-视频问答样本和2.8万个图像-三维问答样本,要求模型对不同的输入模态进行判别性推理。