Single-stage multi-person human pose estimation (MPPE) methods have shown great performance improvements, but existing methods fail to disentangle features by individual instances under crowded scenes. In this paper, we propose a bounding box-level instance representation learning called BoIR, which simultaneously solves instance detection, instance disentanglement, and instance-keypoint association problems. Our new instance embedding loss provides a learning signal on the entire area of the image with bounding box annotations, achieving globally consistent and disentangled instance representation. Our method exploits multi-task learning of bottom-up keypoint estimation, bounding box regression, and contrastive instance embedding learning, without additional computational cost during inference. BoIR is effective for crowded scenes, outperforming state-of-the-art on COCO val (0.8 AP), COCO test-dev (0.5 AP), CrowdPose (4.9 AP), and OCHuman (3.5 AP). Code will be available at https://github.com/uyoung-jeong/BoIR
翻译:单阶段多人人体姿态估计(MPPE)方法已展现出显著的性能提升,但现有方法在拥挤场景下无法有效分离不同个体的特征表征。本文提出一种名为BoIR的边界框级实例表征学习方法,该方法同时解决了实例检测、实例分离及实例-关键点关联问题。我们提出的新型实例嵌入损失函数能够在整幅图像区域上基于边界框标注提供学习信号,从而获得全局一致且可分离的实例表征。该方法利用多任务学习框架,同时进行自底向上的关键点估计、边界框回归与对比实例嵌入学习,且在推理阶段不增加额外计算成本。BoIR对拥挤场景具有显著效果,在COCO验证集(提升0.8 AP)、COCO测试开发集(提升0.5 AP)、CrowdPose(提升4.9 AP)和OCHuman(提升3.5 AP)数据集上均优于当前最优方法。代码将在https://github.com/uyoung-jeong/BoIR 开源。