Multi-modal depth estimation is one of the key challenges for endowing autonomous machines with robust robotic perception capabilities. There have been outstanding advances in the development of uni-modal depth estimation techniques based on either monocular cameras, because of their rich resolution, or LiDAR sensors, due to the precise geometric data they provide. However, each of these suffers from some inherent drawbacks, such as high sensitivity to changes in illumination conditions in the case of cameras and limited resolution for the LiDARs. Sensor fusion can be used to combine the merits and compensate for the downsides of these two kinds of sensors. Nevertheless, current fusion methods work at a high level. They process the sensor data streams independently and combine the high-level estimates obtained for each sensor. In this paper, we tackle the problem at a low level, fusing the raw sensor streams, thus obtaining depth estimates which are both dense and precise, and can be used as a unified multi-modal data source for higher level estimation problems. This work proposes a Conditional Random Field model with multiple geometry and appearance potentials. It seamlessly represents the problem of estimating dense depth maps from camera and LiDAR data. The model can be optimized efficiently using the Conjugate Gradient Squared algorithm. The proposed method was evaluated and compared with the state-of-the-art using the commonly used KITTI benchmark dataset.
翻译:多模态深度估计是实现自主机器具备鲁棒感知能力的关键挑战之一。基于单目相机的技术凭借其丰富分辨率,或基于激光雷达传感器凭借其提供的精确几何数据,在单模态深度估计领域均已取得显著进展。然而,这两类技术各自存在固有缺陷:相机对光照变化高度敏感,而激光雷达分辨率有限。传感器融合可整合二者优势并弥补各自不足,但现有融合方法多基于高层处理,即独立处理各传感器数据流后再融合各自的高层估计结果。本文从低层融合入手,直接融合原始传感器数据流,从而获得兼具密度与精度、可作为高层估计任务统一多模态数据源的深度估计结果。本研究提出一种包含多重几何与外观势能的马尔可夫条件随机场模型,该模型能无缝表达从相机与激光雷达数据中估计稠密深度图的问题,并可通过共轭梯度平方算法高效优化。采用KITTI标准基准数据集对所提方法进行了评估,并与当前最优方法进行了对比。