Temporal action segmentation from videos aims at the dense labeling of video frames with multiple action classes in minutes-long videos. Categorized as a long-range video understanding task, researchers have proposed an extended collection of methods and examined their performance using various benchmarks. Despite the rapid development of action segmentation techniques in recent years, there has been no systematic survey in such fields. To this end, in this survey, we analyse and summarize the main contributions and trends for this task. Specifically, we first examine the task definition, common benchmarks, types of supervision, and popular evaluation measures. Furthermore, we systematically investigate two fundamental aspects of this topic, i.e., frame representation and temporal modeling, which are widely and extensively studied in the literature. We then comprehensively review existing temporal action segmentation works, each categorized by their form of supervision. Finally, we conclude our survey by highlighting and identifying several open topics for research. In addition, we supplement our survey with a curated list of temporal action segmentation resources, which is available at https://github.com/atlas-eccv22/awesome-temporal-action-segmentation.
翻译:视频中的时序动作分割旨在对长达数分钟的视频帧进行密集标注,确定每帧所属的多个动作类别。作为一项长距离视频理解任务,研究者们提出了丰富的处理方法,并利用多种基准评估其性能。尽管近年来动作分割技术发展迅速,但该领域仍缺乏系统性综述。为此,本综述系统分析并总结了该任务的主要贡献与发展趋势。具体而言,我们首先考察了任务定义、常用基准、监督类型及主流评估指标。随后,我们对本领域两个基础研究维度——帧表示与时间建模——进行了系统研究,这两个维度在文献中受到广泛而深入的探讨。接着,我们按监督形式分类,全面回顾了现有时序动作分割研究。最后,我们通过强调并指出若干开放研究议题为该综述收尾。此外,本综述还附带了精选的时序动作分割资源列表,详见 https://github.com/atlas-eccv22/awesome-temporal-action-segmentation。