We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e., the forward/noising process) and then learning to reverse the noising process (i.e., the backward/denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g., DETR) by introducing a temporal location query design with faster convergence in training. We further propose a cross-step selective conditioning algorithm for inference acceleration. Extensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance compared to previous art alternatives. The code will be made available at https://github.com/sauradip/DiffusionTAD.
翻译:我们提出了一种基于去噪扩散的时序动作检测(TAD)新公式,简称DiffTAD。该方法以随机时序提案作为输入,能从未修剪的长视频中准确生成动作提案。与以往判别式学习方法不同,这提供了一种生成式建模视角。该能力通过首先将真实提案扩散为随机提案(即前向/噪声添加过程),然后学习逆转噪声添加过程(即反向/去噪过程)来实现。具体而言,我们通过在Transformer解码器(如DETR)中引入具有更快训练收敛速度的时序位置查询设计,建立了去噪过程。我们进一步提出了一种跨步选择性条件算法,用于加速推理。在ActivityNet和THUMOS上的广泛评估表明,与先前的前沿替代方法相比,我们的DiffTAD取得了最佳性能。代码将在https://github.com/sauradip/DiffusionTAD 上开源。