Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. Our experiments validate the effectiveness of training SparklesChat with SparklesDialogue based on MiniGPT-4 and LLaVA-v1.5, which enhances comprehension across multiple images and dialogue turns, and does not compromise single-image understanding capabilities. Qualitative evaluations further demonstrate SparklesChat's generality in handling real-world applications. All resources related to this study are publicly available at https://github.com/HYPJUDY/Sparkles.
翻译:大型语言模型通过指令跟随数据的微调,在多种任务上展现出增强的零样本性能。多模态指令跟随模型通过整合文本与图像进一步扩展了这些能力。然而,现有模型如MiniGPT-4和LLaVA在涉及多图像的场景中面临维持对话连贯性的挑战。一个主要原因是缺乏针对这一关键应用的专用数据集。为弥补这些不足,我们引入了SparklesDialogue,这是首个为词级别交错的多图像与文本交互而量身定制的机器生成对话数据集。此外,我们构建了SparklesEval,一个GPT辅助的基准测试,用于定量评估模型在多图像和多轮对话中的会话能力。随后,我们提出了SparklesChat,一个用于跨多图像进行开放式对话的多模态指令跟随模型。我们的实验验证了基于MiniGPT-4和LLaVA-v1.5使用SparklesDialogue训练SparklesChat的有效性,该训练提升了模型对多图像和多轮对话的理解能力,且未损害其单图像理解能力。定性评估进一步证明了SparklesChat在处理实际应用中的通用性。本研究的所有相关资源已在 https://github.com/HYPJUDY/Sparkles 公开。