In this paper, we explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information available in the datasets, such as heavy intermediate visual observations, procedural names, or natural language step-by-step instructions, for features or supervision signals. However, the task remains formidable due to the implicit causal constraints in the sequencing of steps and the variability inherent in multiple feasible plans. To tackle these intricacies that previous efforts have overlooked, we propose to enhance the capabilities of the agent by infusing it with procedural knowledge. This knowledge, sourced from training procedure plans and structured as a directed weighted graph, equips the agent to better navigate the complexities of step sequencing and its potential variations. We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data, effectively acting as a comprehensive textbook for the training domain. Experimental evaluations across three widely-used datasets under settings of varying complexity reveal that KEPP attains superior, state-of-the-art results while requiring only minimal supervision.
翻译:在本文中,我们探究智能体构建动作步骤逻辑序列的能力,从而形成战略性流程规划。该规划对于从初始视觉观察状态导航至目标视觉结果状态至关重要,正如现实生活教学视频所示。现有工作通过广泛利用数据集中各类可用信息(如丰富的中间视觉观察、流程名称或自然语言分步指令)作为特征或监督信号,取得了部分成功。然而,由于步骤排序中隐含的因果约束以及多种可行计划的内在变异性,该任务仍然极具挑战性。针对先前工作忽视的这些复杂性,我们提出通过向智能体注入程序性知识来增强其能力。这些知识源于训练流程规划,并以有向加权图结构形式呈现,使智能体能够更好地应对步骤排序及其潜在变体的复杂性。我们将所提方法命名为KEPP(知识增强型流程规划系统),该系统利用从训练数据中提取的概率性程序性知识图谱,有效充当训练领域的综合教科书。在三个广泛使用的数据集上,针对不同复杂度设置进行的实验评估表明,KEPP仅需极少监督即可取得最优的先进结果。