Humans perceive and construct the world as an arrangement of simple parametric models. In particular, we can often describe man-made environments using volumetric primitives such as cuboids or cylinders. Inferring these primitives is important for attaining high-level, abstract scene descriptions. Previous approaches for primitive-based abstraction estimate shape parameters directly and are only able to reproduce simple objects. In contrast, we propose a robust estimator for primitive fitting, which meaningfully abstracts complex real-world environments using cuboids. A RANSAC estimator guided by a neural network fits these primitives to a depth map. We condition the network on previously detected parts of the scene, parsing it one-by-one. To obtain cuboids from single RGB images, we additionally optimise a depth estimation CNN end-to-end. Naively minimising point-to-primitive distances leads to large or spurious cuboids occluding parts of the scene. We thus propose an improved occlusion-aware distance metric correctly handling opaque scenes. Furthermore, we present a neural network based cuboid solver which provides more parsimonious scene abstractions while also reducing inference time. The proposed algorithm does not require labour-intensive labels, such as cuboid annotations, for training. Results on the NYU Depth v2 dataset demonstrate that the proposed algorithm successfully abstracts cluttered real-world 3D scene layouts.
翻译:人类通过简单参数化模型的排列来感知和构建世界。具体而言,我们常能用长方体、圆柱体等体积基元描述人造环境。推断这些基元对于获得高层抽象场景描述具有重要意义。现有基于基元抽象的方法直接估计形状参数,仅能重现简单物体。相比之下,我们提出一种鲁棒的基元拟合估计器,利用长方体对复杂真实环境进行有意义的抽象。该RANSAC估计器在神经网络引导下将这些基元拟合至深度图。我们对网络施加场景先前检测部分的约束,逐部分解析场景。为从单张RGB图像获取长方体,我们还端到端优化深度估计CNN。简单最小化点至基元距离会导致大型或虚假长方体遮挡场景局部。为此,我们提出一种改进的遮挡感知距离度量,可正确处理不透明场景。此外,我们提出基于神经网络的长方体求解器,在减少推理时间的同时提供更简洁的场景抽象。所提算法无需长方体标注等人工密集型标签进行训练。在NYU Depth v2数据集上的实验表明,该算法成功实现了对杂乱真实3D场景布局的抽象。