Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and Language Models in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter language model size, representing a significant advancement in multi-modal language models and setting a solid foundation for future explorations.
翻译:现有的多模态模型大多受限于其在多图像、多轮对话中有效处理图像与文本交错输入的能力不足,导致在训练资源分配和数据可获取性方面面临严重制约,进而影响了它们在多样化交互场景中的适应性和可扩展性。为解决此问题,我们提出了DeepSpeed-VisualChat框架,该框架通过融入多模态能力来优化大语言模型(LLMs),重点提升大型视觉与语言模型在处理交错输入方面的性能。本框架的显著特点包括:(1)开源支持多轮与多图像对话;(2)引入创新的多模态因果注意力机制;(3)利用现有数据集的数据混合技术,确保多轮多图像对话中的无缝交互。与现有框架相比,DeepSpeed-VisualChat在参数规模达70B的语言模型上展现出卓越的可扩展性,标志着多模态语言模型领域的重大进步,并为未来探索奠定了坚实基础。