Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. The unclear boundaries of actions in videos often result in imprecise predictions of action boundaries by existing methods. To resolve this issue, we propose a one-stage framework named TriDet. First, we propose a Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. Then, we analyze the rank-loss problem (i.e. instant discriminability deterioration) in transformer-based methods and propose an efficient scalable-granularity perception (SGP) layer to mitigate this issue. To further push the limit of instant discriminability in the video backbone, we leverage the strong representation capability of pretrained large models and investigate their performance on TAD. Last, considering the adequate spatial-temporal context for classification, we design a decoupled feature pyramid network with separate feature pyramids to incorporate rich spatial context from the large model for localization. Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets, including hierarchical (multilabel) TAD datasets.
翻译:时序动作检测(TAD)旨在从非剪辑视频中检测所有动作边界及其对应类别。视频中动作的模糊边界常导致现有方法对动作边界的预测不精确。为解决此问题,我们提出一种名为TriDet的单阶段框架。首先,我们设计了一个三叉戟头(Trident-head),通过估计边界附近的相对概率分布来建模动作边界。随后,我们分析了基于Transformer的方法中的排名损失问题(即瞬时判别性退化),并提出一种高效的可扩展粒度感知层(SGP)以缓解该问题。为进一步提升视频主干网络中的瞬时判别性能力,我们利用预训练大模型的强表征能力,并探究其在TAD任务中的表现。最后,考虑到分类任务对充分时空上下文的需求,我们设计了一个解耦特征金字塔网络,通过分离的特征金字塔从大模型中融入丰富的空间上下文信息以辅助定位。实验结果表明,TriDet在多个TAD数据集(包括层级/多标签TAD数据集)上均具有鲁棒性,并达到了最先进的性能。