We introduce STEP, a novel framework utilizing Transformer-based discriminative model prediction for simultaneous tracking and estimation of pose across diverse animal species and humans. We are inspired by the fact that the human brain exploits spatiotemporal continuity and performs concurrent localization and pose estimation despite the specialization of brain areas for form and motion processing. Traditional discriminative models typically require predefined target states for determining model weights, a challenge we address through Gaussian Map Soft Prediction (GMSP) and Offset Map Regression Adapter (OMRA) Modules. These modules remove the necessity of keypoint target states as input, streamlining the process. Our method starts with a known target state initialized through a pre-trained detector or manual initialization in the initial frame of a given video sequence. It then seamlessly tracks the target and estimates keypoints of anatomical importance as output for subsequent frames. Unlike prevalent top-down pose estimation methods, our approach doesn't rely on per-frame target detections due to its tracking capability. This facilitates a significant advancement in inference efficiency and potential applications. We train and validate our approach on datasets encompassing diverse species. Our experiments demonstrate superior results compared to existing methods, opening doors to various applications, including but not limited to action recognition and behavioral analysis.
翻译:本文提出STEP,一种利用基于Transformer的判别式模型预测实现跨物种动物与人类同步跟踪与姿态估计的新型框架。我们的研究灵感源于人脑能够利用时空连续性,在负责形状与运动处理的脑区功能特化的前提下,仍能同时进行目标定位与姿态估计这一事实。传统判别式模型通常需要预定义目标状态以确定模型权重,我们通过高斯映射软预测模块与偏移映射回归适配器模块解决了这一挑战。这些模块消除了将关键点目标状态作为输入的必要性,从而简化了流程。我们的方法从通过预训练检测器或手动初始化在给定视频序列首帧中获取的已知目标状态开始,随后无缝地跟踪目标并输出后续帧中具有解剖学重要性的关键点估计。与当前主流自上而下的姿态估计方法不同,得益于其跟踪能力,我们的方法无需依赖逐帧目标检测。这显著提升了推理效率并拓展了潜在应用场景。我们在涵盖多物种的数据集上训练并验证了所提方法。实验结果表明,相较于现有方法,我们的方法取得了更优的性能,为动作识别与行为分析等多样化应用开启了新的可能。