Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. To support the training, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. Our experiments validate the effectiveness of SparklesChat in understanding and reasoning across multiple images and dialogue turns. Specifically, SparklesChat outperformed MiniGPT-4 on established vision-and-language benchmarks, including the BISON binary image selection task and the NLVR2 visual reasoning task. Moreover, SparklesChat scored 8.56 out of 10 on SparklesEval, substantially exceeding MiniGPT-4's score of 3.91 and nearing GPT-4's score of 9.26. Qualitative evaluations further demonstrate SparklesChat's generality in handling real-world applications. All resources will be available at https://github.com/HYPJUDY/Sparkles.
翻译:大型语言模型在通过指令跟随数据进行微调后,在多种任务上展现出增强的零样本性能。多模态指令跟随模型通过整合文本与图像进一步扩展了这些能力。然而,现有模型(如MiniGPT-4)在涉及多图像的场景中难以维持对话连贯性,其主要原因之一是缺乏针对这一关键应用的专业数据集。为弥补这些不足,我们提出了SparklesChat——一种面向多图像开放域对话的多模态指令跟随模型。为支持训练,我们引入了SparklesDialogue,这是首个专为词级交织多图像与文本交互设计的机器生成对话数据集。此外,我们构建了SparklesEval——一个基于GPT的基准测试,用于定量评估模型在多图像和多轮对话中的会话能力。实验验证了SparklesChat在多图像与多轮对话理解与推理中的有效性。具体而言,SparklesChat在包括BISON二值图像选择任务和NLVR2视觉推理任务在内的既定视觉-语言基准测试中优于MiniGPT-4。同时,SparklesChat在SparklesEval上获得8.56分(满分10分),显著超过MiniGPT-4的3.91分,并接近GPT-4的9.26分。定性评估进一步展示了SparklesChat在处理实际应用中的泛化性。所有资源将发布于https://github.com/HYPJUDY/Sparkles。