State-of-the-art object pose estimation methods are prone to generating geometrically infeasible pose hypotheses. This problem is prevalent in dexterous manipulation, where estimated poses often intersect with the robotic hand or are not lying on a support surface. We propose a multi-modal pose refinement approach that combines differentiable physics simulation, differentiable rendering and visuo-tactile sensing to optimize object poses for both spatial accuracy and physical consistency. Simulated experiments show that our approach reduces the intersection volume error between the object and robotic hand by 73\% when the initial estimate is accurate and by over 87\% under high initial uncertainty, significantly outperforming standard ICP-based baselines. Furthermore, the improvement in geometric plausibility is accompanied by a concurrent reduction in translation and orientation errors. Achieving pose estimation that is grounded in physical reality while remaining faithful to multi-modal sensor inputs is a critical step toward robust in-hand manipulation.
翻译:现有顶尖的物体姿态估计方法容易产生几何上不可行的姿态假设。这一问题在灵巧操作场景中尤为突出,估计出的姿态常与机械手相交或未正确贴合支撑面。我们提出一种融合可微分物理仿真、可微分渲染与视触觉传感的多模态姿态优化方法,能够同时优化物体姿态的空间精度与物理一致性。仿真实验表明,当初始估计精度较高时,该方法使物体与机械手之间的相交体积误差降低73%;在初始不确定性较高时,该误差降低超过87%,显著优于基于ICP的标准基线方法。此外,几何合理性的提升伴随着平移与方向误差的同步降低。实现既符合物理现实又忠实于多模态传感器输入的姿态估计,是迈向稳健手内操控的关键步骤。