Monocular depth estimation is fundamental for 3D scene understanding and downstream applications. However, even under the supervised setup, it is still challenging and ill-posed due to the lack of full geometric constraints. Although a scene can consist of millions of pixels, there are fewer high-level patterns. We propose iDisc to learn those patterns with internal discretized representations. The method implicitly partitions the scene into a set of high-level patterns. In particular, our new module, Internal Discretization (ID), implements a continuous-discrete-continuous bottleneck to learn those concepts without supervision. In contrast to state-of-the-art methods, the proposed model does not enforce any explicit constraints or priors on the depth output. The whole network with the ID module can be trained end-to-end, thanks to the bottleneck module based on attention. Our method sets the new state of the art with significant improvements on NYU-Depth v2 and KITTI, outperforming all published methods on the official KITTI benchmark. iDisc can also achieve state-of-the-art results on surface normal estimation. Further, we explore the model generalization capability via zero-shot testing. We observe the compelling need to promote diversification in the outdoor scenario. Hence, we introduce splits of two autonomous driving datasets, DDAD and Argoverse. Code is available at http://vis.xyz/pub/idisc .
翻译:单目深度估计是三维场景理解及其下游应用的基础。然而,即使在有监督设置下,由于缺乏完整的几何约束,该任务仍具挑战性且是不适定的。尽管场景可能包含数百万像素,但高阶模式的数目有限。我们提出iDisc方法,通过内部离散化表征来学习这些模式。该方法隐式地将场景划分为一组高阶模式。具体而言,我们的新模块——内部离散化(ID),通过连续-离散-连续的瓶颈结构,在无监督条件下学习这些概念。与现有最优方法不同,所提模型不强制对深度输出施加任何显式约束或先验。凭借基于注意力的瓶颈模块,包含ID模块的整个网络可实现端到端训练。我们的方法在NYU-Depth v2和KITTI数据集上取得了显著提升,并在官方KITTI基准测试中超越所有已发表方法,确立了新的最优性能。iDisc同样在表面法线估计任务上达到了最优结果。此外,我们通过零样本测试探索了模型的泛化能力。在户外场景中,我们观察到提升多样化的迫切需求,因此引入了DDAD和Argoverse两个自动驾驶数据集的分割版本。代码已开源至http://vis.xyz/pub/idisc。