Human-AI interactivity is a critical aspect that reflects the usability of multimodal large language models (MLLMs). However, existing end-to-end MLLMs only allow users to interact with them through language instructions, leading to the limitation of the interactive accuracy and efficiency. In this study, we present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region. This enables MLLMs to focus on the region of interest and achieve finer-grained interaction. Based on precise referring instruction, we propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience. We also construct a multi-grained vision-language instruction-following dataset based on existing datasets and GPT-4 generating. Furthermore, we design a series of evaluation tasks to assess the effectiveness of region recognition and interaction. Experimental results showcase ChatSpot's promising performance.
翻译:人机交互是反映多模态大语言模型(MLLMs)可用性的关键方面。然而,现有的端到端MLLMs仅允许用户通过语言指令进行交互,这限制了交互的准确性和效率。本研究提出了精准引用指令,利用点、框等多种引用表征作为参考提示以定位特定区域。这使得MLLMs能够聚焦兴趣区域,实现更细粒度的交互。基于精准引用指令,我们提出了ChatSpot——一种统一的端到端多模态大语言模型,支持包括鼠标点击、拖放以及绘制框等多种交互形式,从而提供更灵活流畅的交互体验。我们还基于现有数据集和GPT-4生成构建了多粒度视觉-语言指令跟随数据集。此外,我们设计了一系列评估任务以检验区域识别与交互的有效性。实验结果表明ChatSpot具有优异性能。