Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets.In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the superiority of RAP over traditional fixed-length models, establishing it as a strong baseline solution for adaptive procedure planning.
翻译:摘要:教学视频中的流程规划旨在根据初始状态和目标状态的视觉观察生成一系列操作步骤。尽管该任务取得了快速进展,但仍存在若干关键挑战亟待解决:(1)自适应流程:先前研究假设操作步骤数量已知且固定,这在实际场景中序列长度变化的情况下导致模型泛化能力不足。(2)时序关系:理解步骤间的时序关系知识对于生成合理且可执行的计划至关重要。(3)标注成本:为教学视频标注步骤级标签(如时间戳)或序列级标签(如动作类别)需要大量人力且成本高昂,限制了其在大型数据集上的可扩展性。为此,本文提出一种名为“教学视频自适应流程规划”的新颖实用场景,其中流程长度不固定或预先确定。针对上述挑战,我们引入了检索增强规划器(RAP)模型。具体而言,对于自适应流程,RAP采用自回归模型架构自适应地确定动作终止条件;对于时序关系,RAP建立外部记忆模块,从训练视频中显式检索最相关的状态-动作对并修正已生成的流程;为应对高标注成本,RAP利用弱监督学习方式,通过为动作步骤生成伪标签,将训练数据集扩展到其他任务相关但未标注的视频。在CrossTask和COIN基准上的实验表明,RAP优于传统固定长度模型,为自适应流程规划建立了强基线解决方案。