Monocular depth inference is a fundamental problem for scene perception of robots. Specific robots may be equipped with a camera plus an optional depth sensor of any type and located in various scenes of different scales, whereas recent advances derived multiple individual sub-tasks. It leads to additional burdens to fine-tune models for specific robots and thereby high-cost customization in large-scale industrialization. This paper investigates a unified task of monocular depth inference, which infers high-quality depth maps from all kinds of input raw data from various robots in unseen scenes. A basic benchmark G2-MonoDepth is developed for this task, which comprises four components: (a) a unified data representation RGB+X to accommodate RGB plus raw depth with diverse scene scale/semantics, depth sparsity ([0%, 100%]) and errors (holes/noises/blurs), (b) a novel unified loss to adapt to diverse depth sparsity/errors of input raw data and diverse scales of output scenes, (c) an improved network to well propagate diverse scene scales from input to output, and (d) a data augmentation pipeline to simulate all types of real artifacts in raw depth maps for training. G2-MonoDepth is applied in three sub-tasks including depth estimation, depth completion with different sparsity, and depth enhancement in unseen scenes, and it always outperforms SOTA baselines on both real-world data and synthetic data.
翻译:单目深度推断是机器人场景感知的基本问题。特定机器人可能配备相机以及任意类型的可选深度传感器,并位于不同尺度的各种场景中,而近期研究衍生出多个独立的子任务。这导致针对特定机器人微调模型的额外负担,从而在大规模工业化中产生高成本定制化问题。本文研究单目深度推断的统一任务,该任务从未见场景中各类机器人的所有输入原始数据中推断高质量深度图。为此任务开发了基本基准G2-MonoDepth,它包含四个组件:(a) 统一数据表示RGB+X,用于容纳RGB以及具有不同场景尺度/语义、深度稀疏度([0%, 100%])和误差(空洞/噪声/模糊)的原始深度数据;(b) 新颖的统一损失函数,以适应输入原始数据的不同深度稀疏度/误差和输出场景的不同尺度;(c) 改进的网络结构,以将不同场景尺度从输入良好地传播到输出;以及(d) 数据增强流水线,用于模拟原始深度图中所有类型的真实伪影以进行训练。G2-MonoDepth应用于三个子任务,包括深度估计、不同稀疏度的深度补全以及未见场景中的深度增强,并在真实数据和合成数据上均优于SOTA基线方法。