We propose a novel method for joint estimation of shape and pose of rigid objects from their sequentially observed RGB-D images. In sharp contrast to past approaches that rely on complex non-linear optimization, we propose to formulate it as a neural optimization that learns to efficiently estimate the shape and pose. We introduce Deep Directional Distance Function (DeepDDF), a neural network that directly outputs the depth image of an object given the camera viewpoint and viewing direction, for efficient error computation in 2D image space. We formulate the joint estimation itself as a Transformer which we refer to as TransPoser. We fully leverage the tokenization and multi-head attention to sequentially process the growing set of observations and to efficiently update the shape and pose with a learned momentum, respectively. Experimental results on synthetic and real data show that DeepDDF achieves high accuracy as a category-level object shape representation and TransPoser achieves state-of-the-art accuracy efficiently for joint shape and pose estimation.
翻译:我们提出了一种新颖方法,用于从顺序观测的RGB-D图像中联合估计刚性物体的形状与位姿。与以往依赖复杂非线性优化的方法截然不同,我们将其建模为一种神经优化过程,通过学习高效估计形状与位姿。我们引入了深度方向距离函数(DeepDDF)——一种根据相机视点与观察方向直接输出物体深度图像的神经网络,以实现二维图像空间的快速误差计算。我们将联合估计本身构建为Transformer架构,并称之为TransPoser。该方法充分利用分词化与多头注意力机制,分别通过顺序处理不断增长的观测集合以及利用学习动量高效更新形状与位姿。在合成数据与真实数据上的实验结果表明,DeepDDF作为类别级物体形状表征具有高精度,而TransPoser在联合形状与位姿估计中实现了高效的顶尖准确率。