Semantic segmentation and stereo matching are two essential components of 3D environmental perception systems for autonomous driving. Nevertheless, conventional approaches often address these two problems independently, employing separate models for each task. This approach poses practical limitations in real-world scenarios, particularly when computational resources are scarce or real-time performance is imperative. Hence, in this article, we introduce S$^3$M-Net, a novel joint learning framework developed to perform semantic segmentation and stereo matching simultaneously. Specifically, S$^3$M-Net shares the features extracted from RGB images between both tasks, resulting in an improved overall scene understanding capability. This feature sharing process is realized using a feature fusion adaption (FFA) module, which effectively transforms the shared features into semantic space and subsequently fuses them with the encoded disparity features. The entire joint learning framework is trained by minimizing a novel semantic consistency-guided (SCG) loss, which places emphasis on the structural consistency in both tasks. Extensive experimental results conducted on the vKITTI2 and KITTI datasets demonstrate the effectiveness of our proposed joint learning framework and its superior performance compared to other state-of-the-art single-task networks. Our project webpage is accessible at mias.group/S3M-Net.
翻译:语义分割与立体匹配是自动驾驶三维环境感知系统中的两个关键组成部分。然而,传统方法通常将这两个问题独立处理,为每个任务采用单独的模型。这种方式在实际场景中存在实践局限性,尤其是当计算资源稀缺或需要实时性能时。因此,本文提出S$^3$M-Net,一种新颖的联合学习框架,旨在同时执行语义分割与立体匹配。具体而言,S$^3$M-Net在两个任务之间共享从RGB图像中提取的特征,从而提升整体场景理解能力。该特征共享过程通过特征融合自适应(FFA)模块实现,该模块将共享特征有效转换为语义空间,随后将其与编码后的视差特征进行融合。整个联合学习框架通过最小化一种新颖的语义一致性引导(SCG)损失进行训练,该损失着重强调两个任务中的结构一致性。在vKITTI2和KITTI数据集上进行的广泛实验结果表明,我们所提出的联合学习框架具有有效性,且其性能优于其他最先进的单任务网络。我们的项目页面可通过mias.group/S3M-Net访问。