Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation

from arxiv, Accepted to IEEE Robotics and Automation Letters (RA-L). The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP

Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. To systematically eliminate background interference, FiCoP first employs an object-centric disentanglement step to isolate the target from macro-level environmental noise. Building upon this localized region, our core methodological innovations are twofold. Firstly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning and text-guided semantic injection. Secondly, we design a Patch Correlation Predictor (PCP) that leverages a patch-to-patch correlation matrix as a structural prior. This generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP.

翻译：开放词汇六维物体姿态估计使机器人能够仅通过自然语言引导操控任意未见物体。然而，现有方法的关键局限在于依赖无约束的全局匹配策略。在开放世界场景中，试图将锚点特征与整个查询图像空间进行匹配会引入过多歧义，因为目标特征极易与背景干扰项混淆。为解决该问题，我们提出精细对应姿态估计（FiCoP）框架，将易受噪声干扰的全局匹配转变为空间约束的块级对应。为系统性消除背景干扰，FiCoP首先采用以物体为中心的解耦步骤，将目标从宏观环境噪声中分离。基于该局部区域，本方法的核心创新包含两点：其一，提出跨视角全局感知（CPGP）模块，融合双视角特征，通过显式上下文推理与文本引导语义注入建立结构一致性；其二，设计补丁关联预测器（PCP），利用补丁-补丁关联矩阵作为结构先验，生成精确的块级关联图作为空间滤波器，实现细粒度且抗噪的匹配。在REAL275和Toyota-Light数据集上的实验表明，与当前最优方法相比，FiCoP的平均召回率分别提升8.0%和6.1%，凸显其在复杂无约束开放世界环境中为机器人代理提供鲁棒泛化感知的能力。源代码将发布于https://github.com/zjjqinyu/FiCoP。