Agent AI: Surveying the Horizons of Multimodal Interaction

Zane Durante,Qiuyuan Huang,Naoki Wake,Ran Gong,Jae Sung Park,Bidipta Sarkar,Rohan Taori,Yusuke Noda,Demetri Terzopoulos,Yejin Choi,Katsushi Ikeuchi,Hoi Vo,Li Fei-Fei,Jianfeng Gao

Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the ability of models to process and interpret visual and contextual data, which is critical for the creation of more sophisticated and context-aware AI systems. For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. To accelerate research on agent-based multimodal intelligence, we define "Agent AI" as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data, and can produce meaningful embodied action with infinite agent. In particular, we explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback. We argue that by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agents acting and interacting in the physical world, we envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.

翻译：多模态人工智能系统将很可能在我们的日常生活中无处不在。使这些系统更具交互性的一种有前景的方法是将它们具身化为物理和虚拟环境中的智能体。目前，系统利用现有的基础模型作为构建具身智能体的基本模块。将智能体嵌入此类环境有助于模型处理和解释视觉及情境数据的能力，这对构建更复杂、更具情境感知能力的AI系统至关重要。例如，能够感知用户动作、人类行为、环境物体、音频表达以及场景整体情感的系统，可用于指导智能体在特定环境中的响应。为加速基于智能体的多模态智能研究，我们将"Agent AI"定义为一类交互式系统，它能够感知视觉刺激、语言输入及其他环境基础数据，并可通过无限智能体生成有意义的具身动作。具体而言，我们探索通过整合外部知识、多感官输入和人类反馈来改进基于下一具身动作预测的智能体系统。我们论证，通过在具身化环境中开发智能体AI系统，可以缓解大型基础模型的幻觉现象及其生成环境不一致输出的倾向。新兴的Agent AI领域涵盖了多模态交互中更广泛的具身与智能体方面。除了在物理世界中行动与交互的智能体，我们展望未来人们可轻松创建任意虚拟现实或模拟场景，并与嵌入虚拟环境中的智能体进行交互。