We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.
翻译:我们提出任意模态增强语言模型(AnyMAL),这是一个统一模型,能够处理多种输入模态信号(即文本、图像、视频、音频、惯性测量单元运动传感器)并生成文本响应。AnyMAL继承了包括LLaMA-2(70B)在内的最先进大型语言模型强大的基于文本的推理能力,并通过预训练的对齐模块将模态特定信号转换为联合文本空间。为了进一步增强多模态大语言模型的能力,我们使用手动收集的多模态指令集对模型进行微调,这些指令涵盖超越简单问答的多样化主题和任务。我们进行了包括人工评估和自动评估在内的全面实证分析,并在各种多模态任务上展示了最先进的性能。