With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .
翻译:随着大型多模态模型(LMMs)近期的显著进展,其在视觉对话中锚定能力的重要性日益凸显。尽管已有研究试图赋予LMMs锚定能力,但其锚定与对话功能通常相互割裂,且在需要锚定时对话性能急剧下降。根本问题在于缺乏面向锚定视觉对话(GVC)的数据集——现有锚定数据集仅包含简短描述。为解决该问题,我们构建了融合锚定与对话能力的GVC数据。为更全面评估GVC性能,我们提出了Grounding-Bench基准。此外,我们设计了一种通过连接分割模型与语言模型来支持GVC及多种视觉提示的模型架构。实验结果表明,我们的模型在Grounding-Bench上优于其他LMMs,同时在RefCOCO/+/g及Flickr30K Entities等经典锚定基准中取得具有竞争力的表现。代码将开源至https://github.com/UX-Decoder/LLaVA-Grounding。