Various heuristic objectives for modeling hand-object interaction have been proposed in past work. However, due to the lack of a cohesive framework, these objectives often possess a narrow scope of applicability and are limited by their efficiency or accuracy. In this paper, we propose HandyPriors, a unified and general pipeline for pose estimation in human-object interaction scenes by leveraging recent advances in differentiable physics and rendering. Our approach employs rendering priors to align with input images and segmentation masks along with physics priors to mitigate penetration and relative-sliding across frames. Furthermore, we present two alternatives for hand and object pose estimation. The optimization-based pose estimation achieves higher accuracy, while the filtering-based tracking, which utilizes the differentiable priors as dynamics and observation models, executes faster. We demonstrate that HandyPriors attains comparable or superior results in the pose estimation task, and that the differentiable physics module can predict contact information for pose refinement. We also show that our approach generalizes to perception tasks, including robotic hand manipulation and human-object pose estimation in the wild.
翻译:过去的工作中提出了多种用于建模手物交互的启发式目标函数。然而,由于缺乏统一的框架,这些目标函数往往适用范围狭窄,且受限于效率或精度。本文提出HandyPriors——利用可微物理与渲染领域的最新进展,构建的人机交互场景中姿态估计的统一通用流程。我们的方法采用渲染先验与输入图像及分割掩码对齐,同时结合物理先验来减轻跨帧的穿透与相对滑动问题。此外,我们提出两种手物姿态估计方案:基于优化的姿态估计实现更高精度,而基于滤波的跟踪方法利用可微先验作为动力学与观测模型,执行速度更快。实验表明,HandyPriors在姿态估计任务中取得可比或更优的结果,且可微物理模块能够预测接触信息用于姿态优化。我们还证明该方法可泛化至机器人手部操控及现实场景中人-物姿态估计等感知任务。