Temporal Action Detection (TAD) is fundamental yet challenging for real-world video applications. Leveraging the unique benefits of transformers, various DETR-based approaches have been adopted in TAD. However, it has recently been identified that the attention collapse in self-attention causes the performance degradation of DETR for TAD. Building upon previous research, this paper newly addresses the attention collapse problem in cross-attention within DETR-based TAD methods. Moreover, our findings reveal that cross-attention exhibits patterns distinct from predictions, indicating a short-cut phenomenon. To resolve this, we propose a new framework, Prediction-Feedback DETR (Pred-DETR), which utilizes predictions to restore the collapse and align the cross- and self-attention with predictions. Specifically, we devise novel prediction-feedback objectives using guidance from the relations of the predictions. As a result, Pred-DETR significantly alleviates the collapse and achieves state-of-the-art performance among DETR-based methods on various challenging benchmarks including THUMOS14, ActivityNet-v1.3, HACS, and FineAction.
翻译:时序动作检测(TAD)对于现实世界的视频应用至关重要且具有挑战性。借助Transformer的独特优势,多种基于DETR的方法已被应用于TAD领域。然而,近期研究发现,自注意力机制中的注意力塌缩现象导致了DETR在TAD任务中的性能下降。本文在先前研究基础上,首次系统性地探讨了基于DETR的TAD方法中交叉注意力机制的塌缩问题。进一步研究发现,交叉注意力呈现出与预测结果相异的模式,表明存在捷径学习现象。为解决此问题,我们提出了一种新框架——基于预测反馈的DETR(Pred-DETR),该框架利用预测结果重建注意力机制,并使交叉注意力、自注意力与预测目标保持对齐。具体而言,我们通过预测关联性指导,设计了全新的预测反馈优化目标。实验表明,Pred-DETR能显著缓解注意力塌缩问题,并在THUMOS14、ActivityNet-v1.3、HACS和FineAction等多个具有挑战性的基准测试中,取得了基于DETR方法的最先进性能。