Semantic image and video segmentation stand among the most important tasks in computer vision nowadays, since they provide a complete and meaningful representation of the environment by means of a dense classification of the pixels in a given scene. Recently, Deep Learning, and more precisely Convolutional Neural Networks, have boosted semantic segmentation to a new level in terms of performance and generalization capabilities. However, designing Deep Semantic Segmentation models is a complex task, as it may involve application-dependent aspects. Particularly, when considering autonomous driving applications, the robustness-efficiency trade-off, as well as intrinsic limitations - computational/memory bounds and data-scarcity - and constraints - real-time inference - should be taken into consideration. In this respect, the use of additional data modalities, such as depth perception for reasoning on the geometry of a scene, and temporal cues from videos to explore redundancy and consistency, are promising directions yet not explored to their full potential in the literature. In this paper, we conduct a survey on the most relevant and recent advances in Deep Semantic Segmentation in the context of vision for autonomous vehicles, from three different perspectives: efficiency-oriented model development for real-time operation, RGB-Depth data integration (RGB-D semantic segmentation), and the use of temporal information from videos in temporally-aware models. Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective, so that the reader can not only get started, but also be up to date in respect to recent advances in this exciting and challenging research field.
翻译:语义图像与视频分割是当前计算机视觉领域最重要的任务之一,因为它们通过对场景中像素的密集分类,提供对环境完整且有意义的表征。近年来,深度学习,尤其是卷积神经网络,将语义分割的性能和泛化能力提升到了新高度。然而,设计深度语义分割模型是一项复杂任务,因为它涉及依赖具体应用场景的多个方面。特别是在自动驾驶应用中,需要综合考虑鲁棒性与效率的权衡,以及固有限制(如计算/内存约束和数据稀缺性)和约束条件(如实时推理)。在此背景下,利用额外数据模态(如通过深度感知推理场景几何结构)和视频中的时间线索(探索冗余性与一致性)是文献中尚未充分挖掘的有前景方向。本文从三个不同视角对面向自动驾驶视觉应用的深度语义分割领域最新且最相关的研究进展进行了综述:面向实时运行的效率导向模型开发、RGB-深度数据融合(RGB-D语义分割),以及利用视频时间信息的时间感知模型设计。主要目标是系统讨论每个视角下的核心方法、优势、局限、成果与挑战,使读者不仅能入门该领域,还能及时跟踪这一激动人心且充满挑战的研究方向的最新进展。