We present an interactive visual framework named InternGPT, or iGPT for short. The framework integrates chatbots that have planning and reasoning capabilities, such as ChatGPT, with non-verbal instructions like pointing movements that enable users to directly manipulate images or videos on the screen. Pointing (including gestures, cursors, etc.) movements can provide more flexibility and precision in performing vision-centric tasks that require fine-grained control, editing, and generation of visual content. The name InternGPT stands for \textbf{inter}action, \textbf{n}onverbal, and \textbf{chat}bots. Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2. Additionally, in iGPT, an auxiliary control mechanism is used to improve the control capability of LLM, and a large vision-language model termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing ChatGPT-3.5-turbo with 93.89\% GPT-4 Quality). We hope this work can spark new ideas and directions for future interactive visual systems. Welcome to watch the code at https://github.com/OpenGVLab/InternGPT.
翻译:我们提出了一种名为InternGPT(简称iGPT)的交互式视觉框架。该框架将具备规划与推理能力的聊天机器人(如ChatGPT)与非语言指令(如指向动作)相结合,使用户能够直接操控屏幕上的图像或视频。指向动作(包括手势、光标等)在需要精细控制、编辑和生成视觉内容的视觉中心任务中,能够提供更高的灵活性与精度。InternGPT名称中的含义为交互(interaction)、非语言(nonverbal)与聊天机器人(chatbots)。与现有依赖纯语言的交互系统不同,通过融入指向指令,所提出的iGPT显著提升了用户与聊天机器人之间的沟通效率,并改善了聊天机器人在视觉中心任务中的准确性,尤其是在物体数量大于2的复杂视觉场景中。此外,在iGPT中,我们采用辅助控制机制来增强大型语言模型的控制能力,并微调了一个名为Husky的大型视觉语言模型,以实现高质量的多模态对话(在GPT-4质量评估中达到93.89%,超越ChatGPT-3.5-turbo)。我们希望这项工作能为未来交互式视觉系统带来新的思路与方向。欢迎访问 https://github.com/OpenGVLab/InternGPT 查看代码。