Various heuristic objectives for modeling hand-object interaction have been proposed in past work. However, due to the lack of a cohesive framework, these objectives often possess a narrow scope of applicability and are limited by their efficiency or accuracy. In this paper, we propose HandyPriors, a unified and general pipeline for pose estimation in human-object interaction scenes by leveraging recent advances in differentiable physics and rendering. Our approach employs rendering priors to align with input images and segmentation masks along with physics priors to mitigate penetration and relative-sliding across frames. Furthermore, we present two alternatives for hand and object pose estimation. The optimization-based pose estimation achieves higher accuracy, while the filtering-based tracking, which utilizes the differentiable priors as dynamics and observation models, executes faster. We demonstrate that HandyPriors attains comparable or superior results in the pose estimation task, and that the differentiable physics module can predict contact information for pose refinement. We also show that our approach generalizes to perception tasks, including robotic hand manipulation and human-object pose estimation in the wild.
翻译:在以往的工作中,已提出多种用于建模手物交互的启发式目标函数。然而,由于缺乏统一框架,这些目标函数往往适用范围狭窄,且受限于效率或精度。本文提出HandyPriors——一种利用可微物理和渲染最新进展的人-物交互场景位姿估计的统一通用管线。我们的方法采用渲染先验以对齐输入图像和分割掩膜,并借助物理先验以减轻帧间穿透和相对滑动。此外,我们提出两种手物位姿估计方案:基于优化的位姿估计可实现更高精度,而基于滤波的跟踪(利用可微先验作为动力学和观测模型)则执行速度更快。我们证明HandyPriors在位姿估计任务中可达到可比或更优的结果,且可微物理模块能为位姿优化预测接触信息。同时,我们的方法可泛化至感知任务,包括机器人手操控及野外环境下的人-物位姿估计。