We present an interactive visual framework named InternChat, or iChat for short. The framework integrates chatbots that have planning and reasoning capabilities, such as ChatGPT, with non-verbal instructions like pointing movements that enable users to directly manipulate images or videos on the screen. Pointing (including gestures, cursors, etc.) movements can provide more flexibility and precision in performing vision-centric tasks that require fine-grained control, editing, and generation of visual content. The name InternChat stands for interaction, nonverbal, and chatbots. Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iChat significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2. Additionally, in iChat, an auxiliary control mechanism is used to improve the control capability of LLM, and a large vision-language model termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing ChatGPT-3.5-turbo with 93.89% GPT-4 Quality). We hope this work can spark new ideas and directions for future interactive visual systems. Welcome to watch the code at https://github.com/OpenGVLab/InternChat.
翻译:摘要:我们提出一个名为InternChat(简称iChat)的交互式视觉框架。该框架将具备规划与推理能力的聊天机器人(如ChatGPT)与非语言指令(如指向动作)相结合,使用户能够直接操控屏幕上的图像或视频。指向动作(包括手势、光标等)在需要细粒度控制、编辑及生成视觉内容的视觉中心任务中,能提供更高的灵活性与精确性。InternChat的名称代表交互(Interaction)、非语言(Nonverbal)与聊天机器人(Chatbot)。不同于依赖纯语言的现有交互系统,通过引入指向指令,iChat显著提升了用户与聊天机器人之间的通信效率,以及在视觉中心任务中(尤其在物体数量超过2的复杂视觉场景下)聊天机器人的准确性。此外,iChat采用辅助控制机制增强大语言模型(LLM)的控制能力,并微调了名为Husky的大型视觉语言模型,以实现高质量多模态对话(在GPT-4质量评估中达到93.89%,超越ChatGPT-3.5-turbo)。我们期望此工作能为未来交互式视觉系统激发新思路与新方向。欢迎访问代码仓库:https://github.com/OpenGVLab/InternChat。