Multimodal Pretrained Models for Sequential Decision-Making: Synthesis, Verification, Grounding, and Perception

Recently developed pretrained models can encode rich world knowledge expressed in multiple modalities, such as text and images. However, the outputs of these models cannot be integrated into algorithms to solve sequential decision-making tasks. We develop an algorithm that utilizes the knowledge from pretrained models to construct and verify controllers for sequential decision-making tasks, and to ground these controllers to task environments through visual observations. In particular, the algorithm queries a pretrained model with a user-provided, text-based task description and uses the model's output to construct an automaton-based controller that encodes the model's task-relevant knowledge. It then verifies whether the knowledge encoded in the controller is consistent with other independently available knowledge, which may include abstract information on the environment or user-provided specifications. If this verification step discovers any inconsistency, the algorithm automatically refines the controller to resolve the inconsistency. Next, the algorithm leverages the vision and language capabilities of pretrained models to ground the controller to the task environment. It collects image-based observations from the task environment and uses the pretrained model to link these observations to the text-based control logic encoded in the controller (e.g., actions and conditions that trigger the actions). We propose a mechanism to ensure the controller satisfies the user-provided specification even when perceptual uncertainties are present. We demonstrate the algorithm's ability to construct, verify, and ground automaton-based controllers through a suite of real-world tasks, including daily life and robot manipulation tasks.

翻译：近期发展的预训练模型能够编码以文本和图像等多种模态表达丰富的世界知识。然而，这些模型的输出尚无法直接集成到解决序贯决策任务的算法中。我们提出一种算法，利用预训练模型的知识构建并验证序贯决策任务的控制器，并通过视觉观测将这些控制器具身化到任务环境中。具体而言，该算法以用户提供的基于文本的任务描述查询预训练模型，利用其输出构建基于自动机的控制器，编码模型与任务相关的知识；随后验证控制器编码的知识与其他独立可用的知识（包括环境抽象信息或用户提供的规范）是否一致。若验证步骤发现不一致，算法自动细化控制器以消除矛盾。接着，算法利用预训练模型的视觉与语言能力将控制器具身化到任务环境：收集任务环境的图像观测，并通过预训练模型将这些观测与控制器中编码的基于文本的控制逻辑（如触发动作的条件与动作本身）相关联。我们提出一种机制，确保即使存在感知不确定性，控制器仍能满足用户提供的规范。通过一系列真实世界任务（包括日常生活与机器人操作任务），我们展示了该算法构建、验证并具身化基于自动机控制器的能力。