面向开放词汇的6D物体姿态估计：基于跨视角感知的细粒度对应学习 (Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation)

Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. Our core innovation lies in leveraging a patch-to-patch correlation matrix as a structural prior to narrowing the matching scope, effectively filtering out irrelevant clutter to prevent it from degrading pose estimation. Firstly, we introduce an object-centric disentanglement preprocessing to isolate the semantic target from environmental noise. Secondly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning. Finally, we design a Patch Correlation Predictor (PCP) that generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP.

翻译：开放词汇6D物体姿态估计使机器人能够仅依据自然语言描述操控任意未见过的物体。然而，现有方法的一个关键局限在于其依赖于无约束的全局匹配策略。在开放世界场景中，尝试将锚点特征与整个查询图像空间进行匹配会引入过度的模糊性，因为目标特征极易与背景干扰物混淆。为解决这一问题，我们提出了细粒度对应姿态估计框架（FiCoP），该框架将从易受噪声影响的全局匹配转向空间受限的块级对应。我们的核心创新在于利用块到块相关性矩阵作为结构先验来缩小匹配范围，有效滤除无关杂波，防止其降低姿态估计精度。首先，我们引入了以物体为中心的解耦预处理，以将语义目标从环境噪声中分离。其次，提出了跨视角全局感知模块（CPGP），通过显式上下文推理融合双视角特征，建立结构共识。最后，我们设计了块相关性预测器（PCP），该模块生成精确的块级关联图，作为空间滤波器以实现细粒度、抗噪声的匹配。在REAL275和Toyota-Light数据集上的实验表明，相较于最先进方法，FiCoP将平均召回率分别提升了8.0%和6.1%，突显了其在复杂、无约束的开放世界环境中为机器人智能体提供鲁棒且泛化感知的能力。源代码将在https://github.com/zjjqinyu/FiCoP 公开。