Anticipating future actions is inherently uncertain. Given an observed video segment containing ongoing actions, multiple subsequent actions can plausibly follow. This uncertainty becomes even larger when predicting far into the future. However, the majority of existing action anticipation models adhere to a deterministic approach, neglecting to account for future uncertainties. In this work, we rethink action anticipation from a generative view, employing diffusion models to capture different possible future actions. In this framework, future actions are iteratively generated from standard Gaussian noise in the latent space, conditioned on the observed video, and subsequently transitioned into the action space. Extensive experiments on four benchmark datasets, i.e., Breakfast, 50Salads, EpicKitchens, and EGTEA Gaze+, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action anticipation. Our code and trained models will be published on GitHub.
翻译:摘要:预测未来动作本质上具有不确定性。面对包含当前动作的观测视频片段,可能合理存在多个后续动作。当预测较远的未来时,这种不确定性会进一步加剧。然而,现有的大多数动作预测模型采用确定性方法,未能考虑未来的不确定性。本研究从生成式视角重新审视动作预测任务,利用扩散模型捕捉多种可能的未来动作。在该框架下,未来动作在潜空间中从标准高斯噪声迭代生成,以观测视频为条件,随后转换至动作空间。在四个基准数据集(Breakfast、50Salads、EpicKitchens和EGTEA Gaze+)上进行了大量实验,所提方法取得了优于或可比肩现有最先进方法的结果,验证了生成式方法在动作预测中的有效性。我们的代码与预训练模型将在GitHub上发布。