Human pose forecasting is inherently multimodal since multiple futures exist for an observed pose sequence. However, evaluating multimodality is challenging since the task is ill-posed. Therefore, we first propose an alternative paradigm to make the task well-posed. Next, while state-of-the-art methods predict multimodality, this requires oversampling a large volume of predictions. This raises key questions: (1) Can we capture multimodality by efficiently sampling a smaller number of predictions? (2) Subsequently, which of the predicted futures is more likely for an observed pose sequence? We address these questions with MotionMap, a simple yet effective heatmap based representation for multimodality. We extend heatmaps to represent a spatial distribution over the space of all possible motions, where different local maxima correspond to different forecasts for a given observation. MotionMap can capture a variable number of modes per observation and provide confidence measures for different modes. Further, MotionMap allows us to introduce the notion of uncertainty and controllability over the forecasted pose sequence. Finally, MotionMap captures rare modes that are non-trivial to evaluate yet critical for safety. We support our claims through multiple qualitative and quantitative experiments using popular 3D human pose datasets: Human3.6M and AMASS, highlighting the strengths and limitations of our proposed method. Project Page: https://vita-epfl.github.io/MotionMap
翻译:人体姿态预测本质上是多模态的,因为对于观察到的姿态序列存在多种可能的未来状态。然而,由于该任务本身定义不明确,评估多模态性具有挑战性。因此,我们首先提出一种替代范式以使该任务定义明确。其次,尽管现有最先进方法能够预测多模态性,但这需要对大量预测结果进行过采样。这引出了两个关键问题:(1) 能否通过高效采样较少数量的预测来捕捉多模态性?(2) 对于观察到的姿态序列,哪些预测的未来状态更可能发生?我们通过MotionMap来解决这些问题,这是一种基于热图的简单而有效的多模态表示方法。我们将热图扩展至表示所有可能运动空间上的空间分布,其中不同的局部极大值对应于给定观测的不同预测结果。MotionMap能够捕捉每个观测中可变数量的模态,并为不同模态提供置信度度量。此外,MotionMap使我们能够引入对预测姿态序列的不确定性和可控性概念。最后,MotionMap能够捕捉难以评估但对安全至关重要的罕见模态。我们通过使用主流3D人体姿态数据集(Human3.6M和AMASS)进行的多项定性与定量实验,验证了所提方法的优势与局限性。项目页面:https://vita-epfl.github.io/MotionMap