Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. To support the training, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. Our experiments validate the effectiveness of SparklesChat in understanding and reasoning across multiple images and dialogue turns. Specifically, SparklesChat outperformed MiniGPT-4 on established vision-and-language benchmarks, including the BISON binary image selection task and the NLVR2 visual reasoning task. Moreover, SparklesChat scored 8.56 out of 10 on SparklesEval, substantially exceeding MiniGPT-4's score of 3.91 and nearing GPT-4's score of 9.26. Qualitative evaluations further demonstrate SparklesChat's generality in handling real-world applications. All resources are available at https://github.com/HYPJUDY/Sparkles.
翻译:大语言模型在基于指令跟随数据微调后,多项任务的零样本性能得到提升。多模态指令跟随模型通过融合文本与图像扩展了这些能力。然而,现有模型如MiniGPT-4在多图像场景下维持对话连贯性面临挑战,主要原因是缺乏针对这一关键应用的专业数据集。为弥合这些差距,我们提出SparklesChat——一种专为跨多图像开放域对话设计的的多模态指令跟随模型。为支持训练,我们引入SparklesDialogue,这是首个针对词语级交错多图像与文本交互设计的机器生成对话数据集。此外,我们构建了SparklesEval——一种基于GPT辅助的基准测试,用于定量评估模型在多图像多轮对话中的对话能力。实验验证了SparklesChat在多图像与多轮对话中的理解与推理有效性。具体而言,SparklesChat在包括BISON二值图像选择任务和NLVR2视觉推理任务在内的既有视觉-语言基准测试中超越MiniGPT-4。同时,SparklesChat在SparklesEval上获得8.56分(满分10分),显著高于MiniGPT-4的3.91分,接近GPT-4的9.26分。定性评估进一步展示了SparklesChat在真实应用场景中的普适性。所有资源详见https://github.com/HYPJUDY/Sparkles。