We present an interactive visual framework named InternChat, or iChat for short. The framework integrates chatbots that have planning and reasoning capabilities, such as ChatGPT, with non-verbal instructions like pointing movements that enable users to directly manipulate images or videos on the screen. Pointing (including gestures, cursors, etc.) movements can provide more flexibility and precision in performing vision-centric tasks that require fine-grained control, editing, and generation of visual content. The name InternChat stands for interaction, nonverbal, and chatbots. Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iChat significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2. Additionally, in iChat, an auxiliary control mechanism is used to improve the control capability of LLM, and a large vision-language model termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing ChatGPT-3.5-turbo with 93.89% GPT-4 Quality). We hope this work can spark new ideas and directions for future interactive visual systems. Welcome to watch the code at https://github.com/OpenGVLab/InternChat.
翻译:本文提出了一种名为InternChat(简称iChat)的交互式视觉框架。该框架将具备规划与推理能力的聊天机器人(如ChatGPT)与指向动作等非语言指令相结合,使用户能够直接操作屏幕上的图像或视频。指向动作(包括手势、光标等)在为需要精细控制、编辑和生成视觉内容的视觉中心任务提供更高灵活性与精确性。InternChat这一名称代表交互(Interaction)、非语言(Nonverbal)与聊天机器人(Chatbots)的结合。与现有依赖纯语言的交互系统不同,通过引入指向指令,所提出的iChat显著提升了用户与聊天机器人之间的通信效率,以及在视觉中心任务中聊天机器人的准确性,尤其是在目标数量大于2的复杂视觉场景中。此外,iChat采用辅助控制机制增强大语言模型的控制能力,并微调了一个名为Husky的大型视觉-语言模型,用于高质量的多模态对话(以93.89%的GPT-4质量评分令ChatGPT-3.5-turbo印象深刻)。我们期望这项工作能为未来交互式视觉系统激发新思路与新方向。欢迎访问代码仓库:https://github.com/OpenGVLab/InternChat。