In this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-n-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end-to-end manner with a manifold-aware network head, taking into account the Riemannian geometry of SPD matrices. Experiments and ablations consistently demonstrate the relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.
翻译:本文研究了从单张RGB图像中估计6-DoF物体姿态的问题。间接方法通常先预测中间2D关键点,再通过Perspective-n-Point求解器处理,已展现出优异性能。直接方法以端到端方式回归姿态,通常计算效率更高但精度较低。然而,直接预测头依赖于全局池化特征,忽略了空间二阶统计量(尽管其对姿态预测具有信息价值)。同时,多数情况下它们预测的离散姿态表示缺乏鲁棒性。为此,本文提出一种协方差池化表示,将卷积特征分布编码为对称正定矩阵。此外,通过其Cholesky分解,我们提出一种新颖的SPD矩阵形式的姿态编码。利用一种考虑SPD矩阵黎曼几何的流形感知网络头,以端到端方式回归姿态。实验和消融研究一致证明了二阶池化与连续表示在直接姿态回归(包括部分遮挡场景)中的有效性。