We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/
翻译:我们提出MM-REACT,一种将ChatGPT与视觉专家库整合以实现多模态推理与行动的系统范式。本文定义并探索了一系列亟待解决但可能超越现有视觉及视觉-语言模型能力的高级视觉任务。为达成此类高级视觉智能,MM-REACT引入了一种文本提示设计,可表示文本描述、文本化空间坐标以及针对密集视觉信号(如图像和视频)的对齐文件名。该提示设计使语言模型能够接收、关联并处理多模态信息,从而促进ChatGPT与各类视觉专家的协同整合。零样本实验证明了MM-REACT在处理指定兴趣能力方面的有效性,及其在需要高级视觉理解的不同场景中的广泛应用。此外,我们讨论并比较了MM-REACT的系统范式与另一种通过联合微调扩展语言模型以适配多模态场景的替代方案。相关代码、演示、视频及可视化资源参见 https://multimodal-react.github.io/