Temporal action segmentation is crucial for understanding long-form videos. Previous works on this task commonly adopt an iterative refinement paradigm by using multi-stage models. Our paper proposes an essentially different framework via denoising diffusion models, which nonetheless shares the same inherent spirit of such iterative refinement. In this framework, action predictions are progressively generated from random noise with input video features as conditions. To enhance the modeling of three striking characteristics of human actions, including the position prior, the boundary ambiguity, and the relational dependency, we devise a unified masking strategy for the conditioning inputs in our framework. Extensive experiments on three benchmark datasets, i.e., GTEA, 50Salads, and Breakfast, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action segmentation. Our codes will be made available.
翻译:时间动作分割对于理解长视频至关重要。以往工作通常采用多阶段模型进行迭代细化范式。本文提出一种基于去噪扩散模型的本质不同框架,却与这种迭代细化共享相同的内在精神。在此框架中,动作预测从随机噪声逐步生成,以输入视频特征作为条件。为增强对人类动作三个显著特征(位置先验、边界模糊性和关系依赖性)的建模,我们设计了一种针对框架中条件输入的统一掩码策略。在GTEA、50Salads和Breakfast三个基准数据集上的大量实验表明,所提方法取得了优于或堪比最先进方法的结果,证明了生成式方法用于动作分割的有效性。我们的代码将公开。