Predicting 3D human poses in real-world scenarios, also known as human pose forecasting, is inevitably subject to noisy inputs arising from inaccurate 3D pose estimations and occlusions. To address these challenges, we propose a diffusion-based approach that can predict given noisy observations. We frame the prediction task as a denoising problem, where both observation and prediction are considered as a single sequence containing missing elements (whether in the observation or prediction horizon). All missing elements are treated as noise and denoised with our conditional diffusion model. To better handle long-term forecasting horizon, we present a temporal cascaded diffusion model. We demonstrate the benefits of our approach on four publicly available datasets (Human3.6M, HumanEva-I, AMASS, and 3DPW), outperforming the state-of-the-art. Additionally, we show that our framework is generic enough to improve any 3D pose prediction model as a pre-processing step to repair their inputs and a post-processing step to refine their outputs. The code is available online: \url{https://github.com/vita-epfl/DePOSit}.
翻译:在现实场景中预测3D人体姿态(也称为人体姿态预测)不可避免地会受到由不准确的3D姿态估计和遮挡引起的噪声输入的影响。为了应对这些挑战,我们提出了一种基于扩散的方法,能够根据含噪观测进行预测。我们将预测任务定义为去噪问题,其中观测和预测均被视为包含缺失元素(无论位于观测序列还是预测序列)的单一序列。所有缺失元素均被视作噪声,并通过我们设计的条件扩散模型进行去噪处理。为更好地处理长时域预测,我们提出了时序级联扩散模型。我们在四个公开数据集(Human3.6M、HumanEva-I、AMASS和3DPW)上验证了该方法的效果,其性能超越了当前最先进技术。此外,我们证明该框架具有通用性,可作为预处理步骤修复输入、后处理步骤优化输出,从而改进任何3D姿态预测模型。代码已开源:\url{https://github.com/vita-epfl/DePOSit}。