Embodied agents have achieved prominent performance in following human instructions to complete tasks. However, the potential of providing instructions informed by texts and images to assist humans in completing tasks remains underexplored. To uncover this capability, we present the multimodal procedural planning (MPP) task, in which models are given a high-level goal and generate plans of paired text-image steps, providing more complementary and informative guidance than unimodal plans. The key challenges of MPP are to ensure the informativeness, temporal coherence,and accuracy of plans across modalities. To tackle this, we propose Text-Image Prompting (TIP), a dual-modality prompting method that jointly leverages zero-shot reasoning ability in large language models (LLMs) and compelling text-to-image generation ability from diffusion-based models. TIP improves the interaction in the dual modalities using Text-to-Image Bridge and Image-to-Text Bridge, allowing LLMs to guide the textual-grounded image plan generation and leveraging the descriptions of image plans to ground the textual plan reversely. To address the lack of relevant datasets, we collect WIKIPLAN and RECIPEPLAN as a testbed for MPP. Our results show compelling human preferences and automatic scores against unimodal and multimodal baselines on WIKIPLAN and RECIPEPLAN in terms of informativeness, temporal coherence, and plan accuracy. Our code and data: https://github.com/YujieLu10/MPP.
翻译:具身智能体在遵循人类指令完成任务方面已取得显著成效。然而,利用文本与图像相结合的指令辅助人类完成任务的潜力仍未被充分探索。为揭示这一能力,我们提出多模态程序规划(MPP)任务:给定高层次目标,模型需生成图文配对步骤序列,相较于单模态规划提供更具互补性与信息量的指导。MPP的核心挑战在于确保跨模态规划的信息丰富性、时序连贯性与准确性。为此,我们提出双模态提示方法——文本-图像提示(TIP),该方法联合利用大语言模型(LLM)的零样本推理能力与扩散模型优异的文本到图像生成能力。TIP通过"文本到图像桥接"与"图像到文本桥接"增强双模态交互,使LLM能够引导基于文本的图像规划生成,同时利用图像规划的描述反向约束文本规划。针对相关数据集缺失问题,我们收集了WIKIPLAN与RECIPEPLAN作为MPP测试平台。实验结果表明,在WIKIPLAN与RECIPEPLAN数据集上,相较于单模态及多模态基线方法,我们的方法在信息丰富性、时序连贯性与规划准确性方面均展现出显著的人工偏好优势与自动评估得分。相关代码与数据已开源:https://github.com/YujieLu10/MPP。