Temporal action segmentation (TAS) from videos aims at densely identifying video frames in minutes-long videos with multiple action classes. As a long-range video understanding task, researchers have developed an extended collection of methods and examined their performance using various benchmarks. Despite the rapid growth of TAS techniques in recent years, no systematic survey has been conducted in these sectors. In this survey, we analyze and summarize the most significant contributions and trends to this endeavor. In particular, we first examine the task definition, common benchmarks, types of supervision, and prevalent evaluation measures. In addition, we systematically investigate two essential techniques of this topic, i.e., frame representation, and temporal modeling, which have been studied extensively in the literature. We then conduct a thorough review of existing TAS works categorized by their levels of supervision and conclude our survey by identifying and emphasizing several research gaps. In addition, we have curated a list of TAS resources, which is available at https://github.com/atlas-eccv22/awesome-temporal-action-segmentation.
翻译:视频中的时间动作分割(TAS)旨在对持续数分钟且包含多种动作类别的视频中的帧进行密集识别。作为一项长视频理解任务,研究人员已开发出大量方法,并通过各种基准评估其性能。尽管近年来TAS技术发展迅速,但该领域尚未有系统的综述研究。在本综述中,我们分析并总结了该领域最重要的贡献和趋势。具体而言,我们首先探讨了任务定义、常用基准、监督类型和主流评估指标。此外,我们系统性地研究了该课题的两项关键技术——帧表示和时间建模,这两方面在文献中已被广泛探讨。随后,我们按照监督等级对现有TAS工作进行分类综述,并通过识别并强调若干研究空白来总结本综述。此外,我们整理了一份TAS资源列表,可访问https://github.com/atlas-eccv22/awesome-temporal-action-segmentation获取。