Various heuristic objectives for modeling hand-object interaction have been proposed in past work. However, due to the lack of a cohesive framework, these objectives often possess a narrow scope of applicability and are limited by their efficiency or accuracy. In this paper, we propose HandyPriors, a unified and general pipeline for pose estimation in human-object interaction scenes by leveraging recent advances in differentiable physics and rendering. Our approach employs rendering priors to align with input images and segmentation masks along with physics priors to mitigate penetration and relative-sliding across frames. Furthermore, we present two alternatives for hand and object pose estimation. The optimization-based pose estimation achieves higher accuracy, while the filtering-based tracking, which utilizes the differentiable priors as dynamics and observation models, executes faster. We demonstrate that HandyPriors attains comparable or superior results in the pose estimation task, and that the differentiable physics module can predict contact information for pose refinement. We also show that our approach generalizes to perception tasks, including robotic hand manipulation and human-object pose estimation in the wild.
翻译:在过去的工作中,针对手物交互建模已提出多种启发式目标函数。然而,由于缺乏统一框架,这些目标函数往往适用范围狭窄,且受限于效率或精度。本文提出HandyPriors——一种利用可微物理与渲染技术最新进展的统一通用人体-物体交互场景位姿估计框架。该方法通过渲染先验对齐输入图像与分割掩膜,同时结合物理先验减少帧间穿透与相对滑动。此外,我们提出两种手部与物体位姿估计方案:基于优化的位姿估计可获得更高精度,而基于滤波的追踪方法利用可微先验作为动力学与观测模型,执行速度更快。实验表明,HandyPriors在位姿估计任务中达到可媲美甚至更优的结果,且其可微物理模块能预测位姿精化的接触信息。我们还证明该方法可泛化至机器人手部操控、野外人体-物体位姿估计等感知任务。