Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks. When such models are deployed in real world environments, they inevitably interface with other entities and agents. For example, language models are often used to interact with human beings through dialogue, and visual perception models are used to autonomously navigate neighborhood streets. In response to these developments, new paradigms are emerging for training foundation models to interact with other agents and perform long-term reasoning. These paradigms leverage the existence of ever-larger datasets curated for multimodal, multitask, and generalist interaction. Research at the intersection of foundation models and decision making holds tremendous promise for creating powerful new systems that can interact effectively across a diverse range of applications such as dialogue, autonomous driving, healthcare, education, and robotics. In this manuscript, we examine the scope of foundation models for decision making, and provide conceptual tools and technical background for understanding the problem space and exploring new research directions. We review recent approaches that ground foundation models in practical decision making applications through a variety of methods such as prompting, conditional generative modeling, planning, optimal control, and reinforcement learning, and discuss common challenges and open problems in the field.
翻译:在多样化大规模数据上预训练的基础模型已展现出在广泛视觉与语言任务中的非凡能力。当此类模型部署于真实环境时,它们不可避免地需要与其他实体和智能体交互。例如,语言模型常通过对话与人类交互,视觉感知模型则用于自主导航街区。针对这些发展,训练基础模型以与其他智能体交互并执行长期推理的新范式正在兴起。这些范式利用日益庞大的多模态、多任务及通用交互数据集。基础模型与决策领域的交叉研究有望催生强大新系统,使其能在对话、自动驾驶、医疗、教育及机器人等多样化应用中有效交互。本文审视了面向决策的基础模型范畴,提供概念工具与技术背景以理解问题空间并探索新研究方向。我们回顾了近期通过提示、条件生成建模、规划、最优控制及强化学习等方法将基础模型落地于实际决策应用的研究,并探讨了该领域的共性挑战与开放性问题。