A Cross-Scale Hierarchical Transformer with Correspondence-Augmented Attention for inferring Bird's-Eye-View Semantic Segmentation

As bird's-eye-view (BEV) semantic segmentation is simple-to-visualize and easy-to-handle, it has been applied in autonomous driving to provide the surrounding information to downstream tasks. Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing. The recent work implemented this task by learning the content and position relationship via the vision Transformer (ViT). However, the quadratic complexity of ViT confines the relationship learning only in the latent layer, leaving the scale gap to impede the representation of fine-grained objects. And their plain fusion method of multi-view features does not conform to the information absorption intention in representing BEV features. To tackle these issues, we propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inferring. Specifically, we devise a hierarchical framework to refine the BEV feature representation, where the last size is only half of the final segmentation. To save the computation increase caused by this hierarchical framework, we exploit the cross-scale Transformer to learn feature relationships in a reversed-aligning way, and leverage the residual connection of BEV features to facilitate information transmission between scales. We propose correspondence-augmented attention to distinguish conducive and inconducive correspondences. It is implemented in a simple yet effective way, amplifying attention scores before the Softmax operation, so that the position-view-related and the position-view-disrelated attention scores are highlighted and suppressed. Extensive experiments demonstrate that our method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.

翻译：由于鸟瞰图（BEV）语义分割具有可视化简单、易于处理的优势，已被应用于自动驾驶领域，为下游任务提供周围环境信息。基于多相机视图图像推断BEV语义分割是学界广泛采用的方法，因其设备成本低廉且可实时处理。近期研究通过视觉Transformer（ViT）学习内容与位置关系来实现该任务。然而，ViT的二次复杂度将关系学习局限于潜在层，导致尺度差距阻碍细粒度物体的表征；同时，其多视图特征简单融合方法不符合BEV特征表征中信息吸收的意图。为解决这些问题，我们提出一种新颖的跨尺度层次化对应增强注意力Transformer用于语义分割推理。具体而言，我们设计层次化框架对BEV特征表征进行细化，其中最后一层尺寸仅为最终分割结果的一半。为降低该层次化框架带来的计算量增长，我们采用跨尺度Transformer以逆向对齐方式学习特征关系，并利用BEV特征的残差连接促进跨尺度信息传递。我们提出对应增强注意力机制以区分有益与无益的对应关系。该机制以简单有效的方式实现，在Softmax操作前放大注意力分数，从而突出和抑制与位置视角相关/非相关的注意力分数。大量实验表明，我们的方法在多相机视图图像推断BEV语义分割任务上达到最优性能。