Frequent interactions between individuals are a fundamental challenge for pose estimation algorithms. Current pipelines either use an object detector together with a pose estimator (top-down approach), or localize all body parts first and then link them to predict the pose of individuals (bottom-up). Yet, when individuals closely interact, top-down methods are ill-defined due to overlapping individuals, and bottom-up methods often falsely infer connections to distant bodyparts. Thus, we propose a novel pipeline called bottom-up conditioned top-down pose estimation (BUCTD) that combines the strengths of bottom-up and top-down methods. Specifically, we propose to use a bottom-up model as the detector, which in addition to an estimated bounding box provides a pose proposal that is fed as condition to an attention-based top-down model. We demonstrate the performance and efficiency of our approach on animal and human pose estimation benchmarks. On CrowdPose and OCHuman, we outperform previous state-of-the-art models by a significant margin. We achieve 78.5 AP on CrowdPose and 48.5 AP on OCHuman, an improvement of 8.6% and 7.8% over the prior art, respectively. Furthermore, we show that our method strongly improves the performance on multi-animal benchmarks involving fish and monkeys. The code is available at https://github.com/amathislab/BUCTD
翻译:个体间的频繁交互是姿态估计算法面临的一项根本性挑战。现有流程要么使用目标检测器配合姿态估计器(自顶向下方法),要么先定位所有身体部位再将其连接以预测个体姿态(自底向上方法)。然而,当个体紧密交互时,自顶向下方法因个体重叠而难以定义,自底向上方法则常错误地连接至远处的身体部位。为此,我们提出一种名为自底向上条件化自顶向下姿态估计(BUCTD)的新型流程,该方法融合了自底向上与自顶向下方法的优势。具体而言,我们提议使用自底向上模型作为检测器,该模型除提供估计边界框外,还输出姿态提议作为条件输入到基于注意力的自顶向下模型。我们在动物和人体姿态估计基准上验证了本方法的性能与效率。在CrowdPose和OCHuman数据集上,我们以显著优势超越了此前最先进的模型。我们在CrowdPose上达到78.5 AP,在OCHuman上达到48.5 AP,相较于先前技术分别提升了8.6%和7.8%。此外,我们证明该方法在涉及鱼类和猴子的多动物基准测试中大幅提升了性能。代码已开源:https://github.com/amathislab/BUCTD