Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Recently, DETR-based models for TAD have been prevailing thanks to their unique benefits. However, transformers demand a huge dataset, and unfortunately data scarcity in TAD causes a severe degeneration. In this paper, we identify two crucial problems from data scarcity: attention collapse and imbalanced performance. To this end, we propose a new pre-training strategy, Long-Term Pre-training (LTP), tailored for transformers. LTP has two main components: 1) class-wise synthesis, 2) long-term pretext tasks. Firstly, we synthesize long-form video features by merging video snippets of a target class and non-target classes. They are analogous to untrimmed data used in TAD, despite being created from trimmed data. In addition, we devise two types of long-term pretext tasks to learn long-term dependency. They impose long-term conditions such as finding second-to-fourth or short-duration actions. Our extensive experiments show state-of-the-art performances in DETR-based methods on ActivityNet-v1.3 and THUMOS14 by a large margin. Moreover, we demonstrate that LTP significantly relieves the data scarcity issues in TAD.
翻译:时序动作检测(TAD)具有挑战性,但对现实世界视频应用至关重要。近年来,基于DETR的TAD模型因其独特优势而日益流行。然而,Transformer需要海量数据集,而TAD领域的数据稀缺性不幸地导致了严重的性能退化。本文中,我们从数据稀缺性中识别出两个关键问题:注意力崩溃与性能不均衡。为此,我们提出了一种专为Transformer设计的新型预训练策略——长期预训练(LTP)。LTP包含两个核心组件:1)类别合成,2)长期预文本任务。首先,我们通过合并目标类别与非目标类别的视频片段来合成长时视频特征。这些合成特征虽源自修剪后的数据,但其性质与TAD中使用的未修剪数据相似。此外,我们设计了两种长期预文本任务以学习长时依赖关系,这些任务通过施加长期条件(例如定位第二至第四个动作或短时动作)来实现。我们在ActivityNet-v1.3和THUMOS14数据集上的大量实验表明,该方法在基于DETR的方法中取得了显著领先的先进性能。更重要的是,我们证明了LTP能有效缓解TAD中的数据稀缺问题。