RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos

Procedure Planning in instructional videos entails generating a sequence of action steps based on visual observations of the initial and target states. Despite the rapid progress in this task, there remain several critical challenges to be solved: (1) Adaptive procedures: Prior works hold an unrealistic assumption that the number of action steps is known and fixed, leading to non-generalizable models in real-world scenarios where the sequence length varies. (2) Temporal relation: Understanding the step temporal relation knowledge is essential in producing reasonable and executable plans. (3) Annotation cost: Annotating instructional videos with step-level labels (i.e., timestamp) or sequence-level labels (i.e., action category) is demanding and labor-intensive, limiting its generalizability to large-scale datasets.In this work, we propose a new and practical setting, called adaptive procedure planning in instructional videos, where the procedure length is not fixed or pre-determined. To address these challenges we introduce Retrieval-Augmented Planner (RAP) model. Specifically, for adaptive procedures, RAP adaptively determines the conclusion of actions using an auto-regressive model architecture. For temporal relation, RAP establishes an external memory module to explicitly retrieve the most relevant state-action pairs from the training videos and revises the generated procedures. To tackle high annotation cost, RAP utilizes a weakly-supervised learning manner to expand the training dataset to other task-relevant, unannotated videos by generating pseudo labels for action steps. Experiments on CrossTask and COIN benchmarks show the superiority of RAP over traditional fixed-length models, establishing it as a strong baseline solution for adaptive procedure planning.

翻译：摘要：教学视频中的流程规划旨在根据初始状态和目标状态的视觉观察生成一系列操作步骤。尽管该任务取得了快速进展，但仍存在若干关键挑战亟待解决：（1）自适应流程：先前研究假设操作步骤数量已知且固定，这在实际场景中序列长度变化的情况下导致模型泛化能力不足。（2）时序关系：理解步骤间的时序关系知识对于生成合理且可执行的计划至关重要。（3）标注成本：为教学视频标注步骤级标签（如时间戳）或序列级标签（如动作类别）需要大量人力且成本高昂，限制了其在大型数据集上的可扩展性。为此，本文提出一种名为“教学视频自适应流程规划”的新颖实用场景，其中流程长度不固定或预先确定。针对上述挑战，我们引入了检索增强规划器（RAP）模型。具体而言，对于自适应流程，RAP采用自回归模型架构自适应地确定动作终止条件；对于时序关系，RAP建立外部记忆模块，从训练视频中显式检索最相关的状态-动作对并修正已生成的流程；为应对高标注成本，RAP利用弱监督学习方式，通过为动作步骤生成伪标签，将训练数据集扩展到其他任务相关但未标注的视频。在CrossTask和COIN基准上的实验表明，RAP优于传统固定长度模型，为自适应流程规划建立了强基线解决方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日