Temporal Action Detection (TAD) is challenging but fundamental for real-world video applications. Recently, DETR-based models have been devised for TAD but have not performed well yet. In this paper, we point out the problem in the self-attention of DETR for TAD; the attention modules focus on a few key elements, called temporal collapse problem. It degrades the capability of the encoder and decoder since their self-attention modules play no role. To solve the problem, we propose a novel framework, Self-DETR, which utilizes cross-attention maps of the decoder to reactivate self-attention modules. We recover the relationship between encoder features by simple matrix multiplication of the cross-attention map and its transpose. Likewise, we also get the information within decoder queries. By guiding collapsed self-attention maps with the guidance map calculated, we settle down the temporal collapse of self-attention modules in the encoder and decoder. Our extensive experiments demonstrate that Self-DETR resolves the temporal collapse problem by keeping high diversity of attention over all layers.
翻译:时间动作检测(TAD)是现实视频应用中既具挑战性又至关重要的问题。近期,基于DETR的模型被提出用于TAD,但表现尚不理想。本文指出DETR在TAD中的自注意力机制存在一个问题:注意力模块聚焦于少数关键元素,称为时序坍缩问题。该问题削弱了编码器和解码器的能力,因为它们的自注意力模块未能发挥应有作用。为解决这一问题,我们提出一个新颖的框架——Self-DETR,利用解码器的交叉注意力图重新激活自注意力模块。通过简单地将交叉注意力图与其转置进行矩阵乘法,我们恢复了编码器特征之间的关系。同样地,我们也获取了解码器查询内部的信息。通过使用计算得到的引导图对坍缩的自注意力图进行引导,我们解决了编码器和解码器中自注意力模块的时序坍缩问题。大量实验表明,Self-DETR通过在所有层中保持注意力的高度多样性,成功解决了时序坍缩问题。