Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and video procedural plans given a specified high-level objective. The main challenges are achieving textual and visual informativeness, temporal coherence, and accuracy in procedural plans. VG-TVP leverages the zero-shot reasoning capability of LLMs, the video-to-text generation ability of the video captioning models, and the text-to-video generation ability of diffusion models. VG-TVP improves the interaction between modalities by proposing a novel Fusion of Captioning (FoC) method and using Text-to-Video Bridge (T2V-B) and Video-to-Text Bridge (V2T-B). They allow LLMs to guide the generation of visually-grounded text plans and textual-grounded video plans. To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset.
翻译:基于大型语言模型(LLM)的智能体已在流程性任务中展现出潜力,但通过文本和视频增强的多模态指令来辅助用户的潜力仍未得到充分探索。为填补这一空白,我们提出了视觉基础文本-视频提示(VG-TVP)方法,这是一种新颖的LLM赋能的多模态流程规划(MPP)框架。它能够在给定高层目标的情况下,生成连贯的文本和视频流程规划。主要挑战在于实现流程规划在文本与视觉信息量、时序连贯性以及准确性方面的要求。VG-TVP利用了LLM的零样本推理能力、视频描述模型的视频到文本生成能力以及扩散模型的文本到视频生成能力。VG-TVP通过提出一种新颖的描述融合(FoC)方法,并使用文本到视频桥接(T2V-B)和视频到文本桥接(V2T-B),改进了模态间的交互。这些机制使得LLM能够指导生成具有视觉基础的文本规划以及具有文本基础的视频规划。为解决适用于MPP的数据集稀缺问题,我们构建了一个名为日常生活任务流程规划(Daily-PP)的新数据集。我们进行了全面的实验和基准测试,以评估人类偏好(涉及文本与视觉信息量、时序连贯性和规划准确性)。我们的VG-TVP方法在Daily-PP数据集上优于单模态基线方法。