Uni6D is the first 6D pose estimation approach to employ a unified backbone network to extract features from both RGB and depth images. We discover that the principal reasons of Uni6D performance limitations are Instance-Outside and Instance-Inside noise. Uni6D's simple pipeline design inherently introduces Instance-Outside noise from background pixels in the receptive field, while ignoring Instance-Inside noise in the input depth data. In this paper, we propose a two-step denoising approach for dealing with the aforementioned noise in Uni6D. To reduce noise from non-instance regions, an instance segmentation network is utilized in the first step to crop and mask the instance. A lightweight depth denoising module is proposed in the second step to calibrate the depth feature before feeding it into the pose regression network. Extensive experiments show that our Uni6Dv2 reliably and robustly eliminates noise, outperforming Uni6D without sacrificing too much inference efficiency. It also reduces the need for annotated real data that requires costly labeling.
翻译:Uni6D是首个采用统一骨干网络从RGB图像和深度图像中提取特征的6D姿态估计方法。我们发现Uni6D性能受限的主要原因是实例外部噪声与实例内部噪声。Uni6D简化的流水线设计天然引入了感受野中背景像素产生的实例外部噪声,同时忽略了输入深度数据中存在的实例内部噪声。本文提出一种针对Uni6D中上述噪声的两步去噪方法:第一步利用实例分割网络裁剪并掩膜实例区域以降低非实例区域的噪声;第二步提出轻量级深度去噪模块,在将深度特征输入姿态回归网络前对其进行校准。大量实验表明,本文提出的Uni6Dv2能够可靠、鲁棒地消除噪声,在几乎不牺牲推理效率的前提下性能优于Uni6D,同时减少了对需要昂贵标注的标注真实数据的需求。