Most approaches for semantic segmentation use only information from color cameras to parse the scenes, yet recent advancements show that using depth data allows to further improve performances. In this work, we focus on transformer-based deep learning architectures, that have achieved state-of-the-art performances on the segmentation task, and we propose to employ depth information by embedding it in the positional encoding. Effectively, we extend the network to multimodal data without adding any parameters and in a natural way that makes use of the strength of transformers' self-attention modules. We also investigate the idea of performing cross-modality operations inside the attention module, swapping the key inputs between the depth and color branches. Our approach consistently improves performances on the Cityscapes benchmark.
翻译:大多数语义分割方法仅利用彩色摄像头信息解析场景,但最新研究表明使用深度数据可进一步提升性能。本文聚焦于在分割任务中取得最先进性能的基于Transformer的深度学习架构,提出通过将深度信息嵌入位置编码来运用该信息。我们有效将网络扩展至多模态数据,无需增加任何参数,并以自然方式利用Transformer自注意力模块的优势。同时,我们探究在注意力模块内部执行跨模态操作的思想,将深度分支与彩色分支的关键输入进行互换。该方法在Cityscapes基准测试中持续提升了性能表现。