Temporal modeling is crucial for multi-frame human pose estimation. Most existing methods directly employ optical flow or deformable convolution to predict full-spectrum motion fields, which might incur numerous irrelevant cues, such as a nearby person or background. Without further efforts to excavate meaningful motion priors, their results are suboptimal, especially in complicated spatiotemporal interactions. On the other hand, the temporal difference has the ability to encode representative motion information which can potentially be valuable for pose estimation but has not been fully exploited. In this paper, we present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts and engages mutual information objectively to facilitate useful motion information disentanglement. To be specific, we design a multi-stage Temporal Difference Encoder that performs incremental cascaded learning conditioned on multi-stage feature difference sequences to derive informative motion representation. We further propose a Representation Disentanglement module from the mutual information perspective, which can grasp discriminative task-relevant motion signals by explicitly defining useful and noisy constituents of the raw motion features and minimizing their mutual information. These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark dataset HiEve, and achieve state-of-the-art performance on three benchmarks PoseTrack2017, PoseTrack2018, and PoseTrack21.
翻译:时间建模对于多帧人体姿态估计至关重要。现有方法大多直接利用光流或可变形卷积预测全频谱运动场,这可能会引入大量无关线索(如附近行人或背景)。由于缺乏进一步挖掘有意义的运动先验信息,这些方法在复杂的时空交互场景中结果欠佳。另一方面,时间差分具有编码代表性运动信息的能力,这类信息对人体姿态估计具有潜在价值,但尚未得到充分利用。本文提出一种新颖的多帧人体姿态估计框架,该框架利用帧间时间差建模动态上下文,并客观引入互信息以促进有效运动信息解耦。具体而言,我们设计了多阶段时间差分编码器,该编码器根据多阶段特征差分序列进行增量级联学习,从而推导出富含信息的运动表征。进一步地,我们从互信息视角提出表征解耦模块,通过明确定义原始运动特征中的有用成分与噪声成分并最小化二者互信息,捕捉与任务相关的判别性运动信号。基于以上方法,我们在HiEve基准数据集上举办的"复杂事件人群姿态估计挑战赛"中位列第一,并在PoseTrack2017、PoseTrack2018和PoseTrack21三个基准数据集上达到最优性能。