Single Image Depth Prediction Made Better: A Multivariate Gaussian Take

Neural-network-based single image depth prediction (SIDP) is a challenging task where the goal is to predict the scene's per-pixel depth at test time. Since the problem, by definition, is ill-posed, the fundamental goal is to come up with an approach that can reliably model the scene depth from a set of training examples. In the pursuit of perfect depth estimation, most existing state-of-the-art learning techniques predict a single scalar depth value per-pixel. Yet, it is well-known that the trained model has accuracy limits and can predict imprecise depth. Therefore, an SIDP approach must be mindful of the expected depth variations in the model's prediction at test time. Accordingly, we introduce an approach that performs continuous modeling of per-pixel depth, where we can predict and reason about the per-pixel depth and its distribution. To this end, we model per-pixel scene depth using a multivariate Gaussian distribution. Moreover, contrary to the existing uncertainty modeling methods -- in the same spirit, where per-pixel depth is assumed to be independent, we introduce per-pixel covariance modeling that encodes its depth dependency w.r.t all the scene points. Unfortunately, per-pixel depth covariance modeling leads to a computationally expensive continuous loss function, which we solve efficiently using the learned low-rank approximation of the overall covariance matrix. Notably, when tested on benchmark datasets such as KITTI, NYU, and SUN-RGB-D, the SIDP model obtained by optimizing our loss function shows state-of-the-art results. Our method's accuracy (named MG) is among the top on the KITTI depth-prediction benchmark leaderboard.

翻译：基于神经网络的单图像深度预测（SIDP）是一项具有挑战性的任务，其目标是在测试时预测场景中每个像素的深度。由于该问题在定义上具有不适定性，因此核心目标是提出一种能够从训练样本中可靠建模场景深度的方法。在追求完美深度估计的过程中，现有大多数先进学习技术会为每个像素预测单个标量深度值。然而，已知训练模型存在精度限制，可能预测出不精确的深度。因此，SIDP方法必须关注测试时模型预测中深度分布的预期变化。基于此，我们提出了一种对逐像素深度进行连续建模的方法，能够预测并推理每个像素的深度及其分布。为此，我们使用多元高斯分布对逐像素场景深度进行建模。此外，与现有同样思路下的不确定性建模方法不同（这些方法假设逐像素深度彼此独立），我们引入了逐像素协方差建模，以编码每个场景点与其他所有场景点之间的深度依赖关系。遗憾的是，逐像素深度协方差建模会导致计算代价高昂的连续损失函数，而我们通过学习整个协方差矩阵的低秩近似高效地解决了这一问题。值得注意的是，在KITTI、NYU和SUN-RGB-D等基准数据集上的测试表明，通过优化我们的损失函数获得的SIDP模型取得了最先进的结果。我们的方法（命名为MG）在KITTI深度预测基准排行榜上位列前茅。