Object pose estimation is a fundamental problem in computer vision and plays a critical role in virtual reality and embodied intelligence, where agents must understand and interact with objects in 3D space. Recently, score based generative models have to some extent solved the rotational symmetry ambiguity problem in category level pose estimation, but their efficiency remains limited by the high sampling cost of score-based diffusion. In this work, we propose a new framework, RFM-Pose, that accelerates category-level 6D object pose generation while actively evaluating sampled hypotheses. To improve sampling efficiency, we adopt a flow-matching generative model and generate pose candidates along an optimal transport path from a simple prior to the pose distribution. To further refine these candidates, we cast the flow-matching sampling process as a Markov decision process and apply proximal policy optimization to fine-tune the sampling policy. In particular, we interpret the flow field as a learnable policy and map an estimator to a value network, enabling joint optimization of pose generation and hypothesis scoring within a reinforcement learning framework. Experiments on the REAL275 benchmark demonstrate that RFM-Pose achieves favorable performance while significantly reducing computational cost. Moreover, similar to prior work, our approach can be readily adapted to object pose tracking and attains competitive results in this setting.
翻译:物体姿态估计是计算机视觉中的一个基本问题,在虚拟现实和具身智能中起着关键作用,其中智能体必须理解并与三维空间中的物体进行交互。近年来,基于分数的生成模型在一定程度上解决了类别级姿态估计中的旋转对称性模糊问题,但其效率仍受限于基于分数的扩散模型的高采样成本。在本工作中,我们提出了一个新框架RFM-Pose,该框架在主动评估采样假设的同时,加速了类别级6D物体姿态的生成。为提高采样效率,我们采用流匹配生成模型,并沿着从简单先验分布到姿态分布的最优传输路径生成姿态候选。为进一步优化这些候选姿态,我们将流匹配采样过程建模为马尔可夫决策过程,并应用近端策略优化对采样策略进行微调。具体而言,我们将流场解释为可学习的策略,并将一个估计器映射为价值网络,从而在强化学习框架内实现姿态生成与假设评分的联合优化。在REAL275基准测试上的实验表明,RFM-Pose在显著降低计算成本的同时,取得了优越的性能。此外,与先前工作类似,我们的方法可以轻松适配于物体姿态跟踪任务,并在该设定下获得了具有竞争力的结果。