This paper addresses the task of 3D pose estimation for a hand interacting with an object from a single image observation. When modeling hand-object interaction, previous works mainly exploit proximity cues, while overlooking the dynamical nature that the hand must stably grasp the object to counteract gravity and thus preventing the object from slipping or falling. These works fail to leverage dynamical constraints in the estimation and consequently often produce unstable results. Meanwhile, refining unstable configurations with physics-based reasoning remains challenging, both by the complexity of contact dynamics and by the lack of effective and efficient physics inference in the data-driven learning framework. To address both issues, we present DeepSimHO: a novel deep-learning pipeline that combines forward physics simulation and backward gradient approximation with a neural network. Specifically, for an initial hand-object pose estimated by a base network, we forward it to a physics simulator to evaluate its stability. However, due to non-smooth contact geometry and penetration, existing differentiable simulators can not provide reliable state gradient. To remedy this, we further introduce a deep network to learn the stability evaluation process from the simulator, while smoothly approximating its gradient and thus enabling effective back-propagation. Extensive experiments show that our method noticeably improves the stability of the estimation and achieves superior efficiency over test-time optimization. The code is available at https://github.com/rongakowang/DeepSimHO.
翻译:本文针对单张图像观测中手与物体交互的三维姿态估计任务展开研究。在建模手-物交互时,以往工作主要利用邻近性线索,却忽略了"手必须稳定抓取物体以抵抗重力、防止物体滑落或掉落"这一动力学本质。这类方法未能利用动力学约束进行估计,因而常产生不稳定的结果。同时,基于物理推理优化不稳定构型仍具挑战性,这既源于接触动力学的复杂性,也源于数据驱动学习框架中缺乏高效可靠的物理推断机制。为解决这两个问题,我们提出DeepSimHO:一种融合前向物理模拟与反向梯度近似的新型深度学习管线。具体而言,对于基础网络估计的初始手-物姿态,我们将其输入物理模拟器以评估稳定性。然而,由于非光滑接触几何与穿透现象的存在,现有可微分模拟器无法提供可靠的状态梯度。为此,我们进一步引入深度网络学习模拟器的稳定性评估过程,同时平滑近似其梯度,从而支持有效的反向传播。大量实验表明,我们的方法显著提升了估计结果的稳定性,并在测试时优化中实现了卓越的效率。代码已开源至https://github.com/rongakowang/DeepSimHO。