In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose an efficient Scalable-Granularity Perception (SGP) layer to mitigate the rank loss problem of self-attention that takes place in the video features and aggregate information across different temporal granularities. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of $69.3\%$ on THUMOS14, outperforming the previous best by $2.5\%$, but with only $74.6\%$ of its latency. The code is released to https://github.com/sssste/TriDet.
翻译:本文提出了一种单阶段框架TriDet用于时序动作检测。现有方法常因视频中模糊的动作边界而导致边界预测不精确。为解决此问题,我们提出了一种新颖的三叉戟头(Trident-head),通过估计边界周围的相对概率分布来建模动作边界。在TriDet的特征金字塔中,我们提出了一种高效的可扩展粒度感知(SGP)层,以缓解视频特征中自注意力机制出现的秩丢失问题,并跨不同时间粒度聚合信息。得益于三叉戟头和基于SGP的特征金字塔,TriDet在三个具有挑战性的基准数据集(THUMOS14、HACS和EPIC-KITCHEN 100)上实现了最先进的性能,且计算成本低于以往方法。例如,TriDet在THUMOS14上取得了平均mAP为69.3%的成绩,比此前最佳性能提升2.5%,而延迟仅为后者的74.6%。代码已开源至https://github.com/sssste/TriDet。