Accurate 3D human pose estimation (3D HPE) is crucial for enabling autonomous vehicles (AVs) to make informed decisions and respond proactively in critical road scenarios. Promising results of 3D HPE have been gained in several domains such as human-computer interaction, robotics, sports and medical analytics, often based on data collected in well-controlled laboratory environments. Nevertheless, the transfer of 3D HPE methods to AVs has received limited research attention, due to the challenges posed by obtaining accurate 3D pose annotations and the limited suitability of data from other domains. We present a simple yet efficient weakly supervised approach for 3D HPE in the AV context by employing a high-level sensor fusion between camera and LiDAR data. The weakly supervised setting enables training on the target datasets without any 2D/3D keypoint labels by using an off-the-shelf 2D joint extractor and pseudo labels generated from LiDAR to image projections. Our approach outperforms state-of-the-art results by up to $\sim$ 13% on the Waymo Open Dataset in the weakly supervised setting and achieves state-of-the-art results in the supervised setting.
翻译:精确的三维人体姿态估计(3D HPE)对于使自动驾驶汽车(AVs)在关键道路场景中做出明智决策并主动响应至关重要。3D HPE在多个领域已取得显著成果,例如人机交互、机器人技术、体育和医学分析,这些研究通常基于在良好控制的实验室环境中收集的数据。然而,由于获取精确的三维姿态标注存在挑战,且其他领域的数据适用性有限,3D HPE方法向自动驾驶汽车的迁移受到的关注有限。我们提出了一种简单高效的弱监督方法,用于自动驾驶场景中的3D HPE,该方法通过摄像头与激光雷达(LiDAR)数据的高层级传感器融合实现。弱监督设置利用现成的二维关节点提取器和从激光雷达到图像投影生成的伪标签,无需任何2D/3D关键点标注即可在目标数据集上进行训练。在弱监督设置下,我们的方法在Waymo开放数据集上的性能比现有最先进结果高出约13%,同时在监督设置下也达到了最先进水平。