This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of `pseudo-labels' to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets, showing a significant improvement of 5% average mAP on the former.
翻译:本文针对点监督时序动作检测的挑战展开研究,该任务中训练集每个动作实例仅标注单个帧。受限于标注点的稀疏特性,当前大多数方法难以有效表征动作的连续结构或动作实例内部的时序与语义依赖关系。这些方法往往仅能学习到动作的判别性片段,导致生成不完整的动作提案。本文提出POTLoc——一种仅利用点级标注的伪标签导向弱监督动作定位Transformer。POTLoc通过自训练策略识别并追踪连续动作结构:基础模型首先仅依靠点级监督生成动作提案,随后通过提案优化与回归提升估计动作边界的精度,进而产生"伪标签"作为补充监督信号。模型架构融合Transformer与时序特征金字塔,用于捕获视频片段依赖关系并建模不同持续时间的动作。伪标签提供动作粗略位置与边界信息,可引导Transformer增强对动作动态的学习。在THUMOS'14和ActivityNet-v1.2数据集上,POTLoc超越当前最先进点监督方法,在THUMOS'14上平均mAP提升达5%。