This paper proposes a method for long-term action anticipation (LTA), the task of predicting action labels and their duration in a video given the observation of an initial untrimmed video interval. We build on an encoder-decoder architecture with parallel decoding and make two key contributions. First, we introduce a bi-directional action context regularizer module on the top of the decoder that ensures temporal context coherence in temporally adjacent segments. Second, we learn from classified segments a transition matrix that models the probability of transitioning from one action to another and the sequence is optimized globally over the full prediction interval. In addition, we use a specialized encoder for the task of action segmentation to increase the quality of the predictions in the observation interval at inference time, leading to a better understanding of the past. We validate our methods on four benchmark datasets for LTA, the EpicKitchen-55, EGTEA+, 50Salads and Breakfast demonstrating superior or comparable performance to state-of-the-art methods, including probabilistic models and also those based on Large Language Models, that assume trimmed video as input. The code will be released upon acceptance.
翻译:本文提出了一种用于长期行为预测(LTA)的方法,该任务旨在给定一段初始未修剪视频片段的观测下,预测视频中行为标签及其持续时间。我们基于一种具有并行解码能力的编码器-解码器架构,并做出两项关键贡献。首先,我们在解码器顶部引入了一个双向行为上下文正则化模块,以确保时序相邻片段间的时间上下文连贯性。其次,我们从已分类的片段中学习一个转移矩阵,该矩阵建模了从一种行为转移到另一种行为的概率,并在整个预测区间上对序列进行全局优化。此外,我们为行为分割任务使用了一个专用编码器,以提高推理时观测区间内预测结果的质量,从而实现对过往内容的更好理解。我们在四个LTA基准数据集(EpicKitchen-55、EGTEA+、50Salads和Breakfast)上验证了我们的方法,结果表明其性能优于或媲美当前最先进的方法,包括那些以修剪后视频作为输入的概率模型以及基于大语言模型的方法。代码将在论文被接受后发布。