Multi-person pose tracking is an important element for many applications and requires to estimate the human poses of all persons in a video and to track them over time. The association of poses across frames remains an open research problem, in particular for online tracking methods, due to motion blur, crowded scenes and occlusions. To tackle the association challenge, we propose a Gated Attention Transformer. The core aspect of our model is the gating mechanism that automatically adapts the impact of appearance embeddings and embeddings based on temporal pose similarity in the attention layers. In order to re-identify persons that have been occluded, we incorporate a pose-conditioned re-identification network that provides initial embeddings and allows to match persons even if the number of visible joints differ between frames. We further propose a matching layer based on gated attention for pose-to-track association and duplicate removal. We evaluate our approach on PoseTrack 2018 and PoseTrack21.
翻译:多人姿态跟踪是许多应用中的重要元素,需要估计视频中所有人的姿态并进行时间跟踪。由于运动模糊、拥挤场景和遮挡,跨帧的姿态关联仍是一个未解决的研究问题,尤其是在在线跟踪方法中。为解决关联挑战,我们提出了一种门控注意力Transformer。我们模型的核心是门控机制,该机制可自动调整外观嵌入和基于时间姿态相似性的嵌入在注意力层中的影响。为了重新识别被遮挡的人,我们引入了一个姿态条件重识别网络,该网络提供初始嵌入,并允许在帧间可见关节点数量不同时进行匹配。我们进一步提出了一种基于门控注意力的匹配层,用于姿态到轨迹关联和重复移除。我们在PoseTrack 2018和PoseTrack21上评估了我们的方法。