In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose an efficient Scalable-Granularity Perception (SGP) layer to mitigate the rank loss problem of self-attention that takes place in the video features and aggregate information across different temporal granularities. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of $69.3\%$ on THUMOS14, outperforming the previous best by $2.5\%$, but with only $74.6\%$ of its latency. The code is released to https://github.com/sssste/TriDet.
翻译:本文提出了一种用于时间动作检测的单阶段框架TriDet。现有方法常因视频中动作边界模糊而导致预测不精确。为解决此问题,我们创新性地提出三叉戟头(Trident-head),通过估计边界周围的相对概率分布来建模动作边界。在TriDet的特征金字塔中,我们引入高效的尺度自适应感知层(SGP层),以缓解视频特征中自注意力机制产生的秩损失问题,并实现跨时间粒度的信息聚合。得益于三叉戟头和基于SGP的特征金字塔,TriDet在THUMOS14、HACS和EPIC-KITCHEN 100三个具有挑战性的基准测试中均取得了领先性能,且计算成本低于现有方法。例如,在THUMOS14数据集上,TriDet的平均mAP达到69.3%,超越此前最佳方法2.5%,而延迟仅为前者的74.6%。代码已开源至https://github.com/sssste/TriDet。