In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.
翻译:本文提出了一种概念简单但功能强大的多模态对话任务基线模型——S3模型,该模型在两个具有挑战性的排行榜(MMMU和AI Journey Contest 2023)上取得了接近最先进水平的结果。该系统基于预训练的大语言模型、针对图像和音频的预训练模态编码器以及一个可训练的模态投影器。针对此类架构提出的高效训练数据混合方案表明,基于强大语言模型并在少量多模态数据上训练的多模态模型,能够在多模态对话任务中高效地执行。