Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at \url{https://github.com/microsoft/visual-chatgpt}.

翻译：ChatGPT凭借其卓越的对话能力和跨领域推理能力，正吸引着广泛的研究兴趣。然而，由于ChatGPT基于语言训练，目前无法处理或生成视觉领域的图像。与此同时，视觉Transformer或Stable Diffusion等视觉基础模型虽然展现出强大的视觉理解与生成能力，但它们仅在特定任务中作为单轮固定输入输出的专家。为此，我们构建了名为\textbf{Visual ChatGPT}的系统，整合多种视觉基础模型，使用户能够通过以下方式与ChatGPT交互：1) 发送和接收不仅限于语言还包括图像的信息；2) 提出需要多个AI模型多步协作的复杂视觉问题或视觉编辑指令；3) 提供反馈并请求修正结果。我们设计了一系列提示词，将多样化的输入输出模型及需要视觉反馈的模型信息注入ChatGPT。实验表明，Visual ChatGPT借助视觉基础模型，为探索ChatGPT的视觉功能打开了大门。我们的系统已在 \url{https://github.com/microsoft/visual-chatgpt} 公开发布。

相关内容

ChatGPT

关注 258

ChatGPT（全名：Chat Generative Pre-trained Transformer），美国OpenAI 研发的聊天机器人程序 [1] ，于2022年11月30日发布。ChatGPT是人工智能技术驱动的自然语言处理工具，它能够通过学习和理解人类的语言来进行对话，还能根据聊天的上下文进行互动，真正像人类一样来聊天交流，甚至能完成撰写邮件、视频脚本、文案、翻译、代码，写论文任务。 [1] https://openai.com/blog/chatgpt/