We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e., the forward/noising process) and then learning to reverse the noising process (i.e., the backward/denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g., DETR) by introducing a temporal location query design with faster convergence in training. We further propose a cross-step selective conditioning algorithm for inference acceleration. Extensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance compared to previous art alternatives. The code will be made available at https://github.com/sauradip/DiffusionTAD.
翻译:我们提出了一种新的时序动作检测(TAD)方法——DiffTAD,该方法基于去噪扩散框架。通过输入随机时序提案,它能够针对未裁剪的长视频准确生成动作提案。这呈现了一种生成式建模视角,与以往判别式学习方法不同。该能力通过以下方式实现:首先将真实提案扩散为随机提案(即前向/加噪过程),然后学习逆转加噪过程(即反向/去噪过程)。具体而言,我们在Transformer解码器(如DETR)中建立去噪过程,通过引入一种训练中收敛更快的时序位置查询设计。我们进一步提出跨步选择性条件算法以加速推理。在ActivityNet和THUMOS上的广泛评估表明,与现有最佳方法相比,我们的DiffTAD取得了顶级性能。代码将开源在https://github.com/sauradip/DiffusionTAD。