ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at \url{https://github.com/microsoft/visual-chatgpt}.
翻译:ChatGPT凭借其卓越的对话能力和跨领域推理能力,正吸引着广泛的研究兴趣。然而,由于ChatGPT基于语言训练,目前无法处理或生成视觉领域的图像。与此同时,视觉Transformer或Stable Diffusion等视觉基础模型虽然展现出强大的视觉理解与生成能力,但它们仅在特定任务中作为单轮固定输入输出的专家。为此,我们构建了名为\textbf{Visual ChatGPT}的系统,整合多种视觉基础模型,使用户能够通过以下方式与ChatGPT交互:1) 发送和接收不仅限于语言还包括图像的信息;2) 提出需要多个AI模型多步协作的复杂视觉问题或视觉编辑指令;3) 提供反馈并请求修正结果。我们设计了一系列提示词,将多样化的输入输出模型及需要视觉反馈的模型信息注入ChatGPT。实验表明,Visual ChatGPT借助视觉基础模型,为探索ChatGPT的视觉功能打开了大门。我们的系统已在 \url{https://github.com/microsoft/visual-chatgpt} 公开发布。