Multi-person pose tracking is an important element for many applications and requires to estimate the human poses of all persons in a video and to track them over time. The association of poses across frames remains an open research problem, in particular for online tracking methods, due to motion blur, crowded scenes and occlusions. To tackle the association challenge, we propose a Dual-Source Attention Transformer that incorporates three core aspects: i) In order to re-identify persons that have been occluded, we propose a pose-conditioned re-identification network that provides an initial embedding and allows to match persons even if the number of visible joints differs between the frames. ii) We incorporate edge embeddings based on temporal pose similarity and the impact of appearance and pose similarity is automatically adapted. iii) We propose an attention based matching layer for pose-to-track association and duplicate removal. We evaluate our approach on Market1501, PoseTrack 2018 and PoseTrack21.
翻译:多人姿态追踪是众多应用的关键要素,需要估计视频中所有人的姿态并随时间进行追踪。由于运动模糊、拥挤场景和遮挡,帧间姿态关联仍是一个开放的研究问题,尤其对于在线追踪方法而言。为了解决关联挑战,我们提出了一种双源注意力Transformer,其包含三个核心方面:i) 为重新识别被遮挡的人,我们提出了一种姿态条件化重识别网络,提供初始嵌入并允许在帧间可见关节点数量不同时进行人体匹配。ii) 我们基于时间姿态相似性融入边缘嵌入,并自动调整外观与姿态相似性的影响。iii) 我们提出一种基于注意力的匹配层,用于姿态到轨迹的关联和重复去除。我们在Market1501、PoseTrack 2018和PoseTrack21数据集上评估了我们的方法。