Accurate camera motion estimation is essential for recovering global human motion in world coordinates from RGB video inputs. SLAM is widely used for estimating camera trajectory and point cloud, but monocular SLAM does so only up to an unknown scale factor. Previous works estimate the scale factor through optimization, but this is unreliable and time-consuming. This paper presents an optimization-free scale calibration framework, Human as Checkerboard (HAC). HAC innovatively leverages the human body predicted by human mesh recovery model as a calibration reference. Specifically, it uses the absolute depth of human-scene contact joints as references to calibrate the corresponding relative scene depth from SLAM. HAC benefits from geometric priors encoded in human mesh recovery models to estimate the SLAM scale and achieves precise global human motion estimation. Simple yet powerful, our method sets a new state-of-the-art performance for global human mesh estimation tasks, reducing motion errors by 50% over prior local-to-global methods while using 100$\times$ less inference time than optimization-based methods. Project page: https://martayang.github.io/HAC.
翻译:精确的相机运动估计对于从RGB视频输入中恢复世界坐标系下的全局人体运动至关重要。SLAM被广泛用于估计相机轨迹和点云,但单目SLAM仅能估计到未知尺度因子。先前的研究通过优化方法估计尺度因子,但这种方法不可靠且耗时。本文提出了一种无需优化的尺度标定框架——人体作为标定板(HAC)。HAC创新性地利用人体网格重建模型预测的人体作为标定参考。具体而言,它使用人体-场景接触关节的绝对深度作为参考,来标定SLAM中对应的相对场景深度。HAC受益于人体网格重建模型中编码的几何先验,从而估计SLAM尺度并实现精确的全局人体运动估计。我们的方法简洁而强大,在全局人体网格估计任务中取得了新的最优性能,相较于先前的局部到全局方法,运动误差降低了50%,同时推理时间比基于优化的方法减少了100倍。项目页面:https://martayang.github.io/HAC。