Early action recognition is an important and challenging problem that enables the recognition of an action from a partially observed video stream where the activity is potentially unfinished or even not started. In this work, we propose a novel model that learns a prototypical representation of the full action for each class and uses it to regularize the architecture and the visual representations of the partial observations. Our model is very simple in design and also efficient. We decompose the video into short clips, where a visual encoder extracts features from each clip independently. Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction. During training, for each partial observation, the model is jointly trained to both predict the label as well as the action prototypical representation which acts as a regularizer. We evaluate our method on multiple challenging real-world datasets and outperform the current state-of-the-art by a significant margin. For example, on early recognition observing only the first 10% of each video, our method improves the SOTA by +2.23 Top-1 accuracy on Something-Something-v2, +3.55 on UCF-101, +3.68 on SSsub21, and +5.03 on EPIC-Kitchens-55, where prior work used either multi-modal inputs (e.g. optical-flow) or batched inference. Finally, we also present exhaustive ablation studies to motivate the design choices we made, as well as gather insights regarding what our model is learning semantically.
翻译:早期动作识别是一个重要且具有挑战性的问题,其目标是从部分观测的视频流中识别动作,此时动作可能尚未完成甚至尚未开始。在本文中,我们提出了一种新颖模型,该模型学习每个类别的完整动作原型表示,并将其用于规范网络架构及部分观测的视觉表示。我们的模型设计极为简洁且高效。我们将视频分解为短片段,视觉编码器独立地从每个片段中提取特征。随后,解码器以在线方式聚合所有片段的特征,用于最终的类别预测。在训练过程中,针对每个部分观测,模型联合训练以同时预测标签和动作原型表示,其中原型表示充当正则化项。我们在多个具有挑战性的真实数据集上评估了该方法,并以显著优势超越了当前最先进水平。例如,在仅观测每个视频前10%内容的早期识别任务中,我们的方法在Something-Something-v2上提升了+2.23的Top-1准确率,在UCF-101上提升了+3.55,在SSsub21上提升了+3.68,在EPIC-Kitchens-55上提升了+5.03,而此前的工作需要多模态输入(如光流)或批量推理。最后,我们还提供了详尽的消融研究,以论证我们的设计选择,并深入探讨模型在语义层面的学习内容。