Temporal action localization (TAL) is a prevailing task due to its great application potential. Existing works in this field mainly suffer from two weaknesses: (1) They often neglect the multi-label case and only focus on temporal modeling. (2) They ignore the semantic information in class labels and only use the visual information. To solve these problems, we propose a novel Co-Occurrence Relation Module (CORM) that explicitly models the co-occurrence relationship between actions. Besides the visual information, it further utilizes the semantic embeddings of class labels to model the co-occurrence relationship. The CORM works in a plug-and-play manner and can be easily incorporated with the existing sequence models. By considering both visual and semantic co-occurrence, our method achieves high multi-label relationship modeling capacity. Meanwhile, existing datasets in TAL always focus on low-semantic atomic actions. Thus we construct a challenging multi-label dataset UCF-Crime-TAL that focuses on high-semantic actions by annotating the UCF-Crime dataset at frame level and considering the semantic overlap of different events. Extensive experiments on two commonly used TAL datasets, \textit{i.e.}, MultiTHUMOS and TSU, and our newly proposed UCF-Crime-TAL demenstrate the effectiveness of the proposed CORM, which achieves state-of-the-art performance on these datasets.
翻译:时序动作定位(TAL)是一项因其巨大应用潜力而备受关注的任务。现有研究主要存在两个不足:(1)常忽视多标签情况而仅关注时序建模;(2)忽略类别标签中的语义信息而仅利用视觉信息。针对这些问题,我们提出了一种新颖的共现关系模块(CORM),该模块显式建模动作间的共现关系。除视觉信息外,它进一步利用类别标签的语义嵌入来建模共现关系。CORM以即插即用方式工作,可轻松集成到现有序列模型中。通过同时考虑视觉与语义共现,我们的方法实现了高水平的多元关系建模能力。同时,现有TAL数据集通常聚焦于低语义原子动作。为此,我们通过对UCF-Crime数据集进行帧级标注,并考虑不同事件的语义重叠,构建了一个关注高语义动作的具有挑战性的多标签数据集UCF-Crime-TAL。在两个常用TAL数据集(即MultiTHUMOS和TSU)以及我们新提出的UCF-Crime-TAL上的广泛实验验证了所提CORM的有效性,在这些数据集上均达到了最先进的性能。