Context and Geometry Aware Voxel Transformer for Semantic Scene Completion

Vision-based Semantic Scene Completion (SSC) has gained much attention due to its widespread applications in various 3D perception tasks. Existing sparse-to-dense approaches typically employ shared context-independent queries across various input images, which fails to capture distinctions among them as the focal regions of different inputs vary and may result in undirected feature aggregation of cross-attention. Additionally, the absence of depth information may lead to points projected onto the image plane sharing the same 2D position or similar sampling points in the feature map, resulting in depth ambiguity. In this paper, we present a novel context and geometry aware voxel transformer. It utilizes a context aware query generator to initialize context-dependent queries tailored to individual input images, effectively capturing their unique characteristics and aggregating information within the region of interest. Furthermore, it extend deformable cross-attention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates. Building upon this module, we introduce a neural network named CGFormer to achieve semantic scene completion. Simultaneously, CGFormer leverages multiple 3D representations (i.e., voxel and TPV) to boost the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. Experimental results demonstrate that CGFormer achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks, attaining a mIoU of 16.87 and 20.05, as well as an IoU of 45.99 and 48.07, respectively. Remarkably, CGFormer even outperforms approaches employing temporal images as inputs or much larger image backbone networks.

翻译：基于视觉的语义场景补全（SSC）因其在各种三维感知任务中的广泛应用而备受关注。现有的稀疏到稠密方法通常在不同输入图像间使用共享的、上下文无关的查询，这未能捕捉它们之间的差异，因为不同输入的关注区域各不相同，并可能导致交叉注意力的无向特征聚合。此外，深度信息的缺失可能导致投影到图像平面上的点共享相同的二维位置或在特征图中具有相似的采样点，从而产生深度歧义。本文提出了一种新颖的上下文与几何感知体素Transformer。它利用一个上下文感知查询生成器来初始化针对单个输入图像的上下文相关查询，有效捕捉其独特特征并在感兴趣区域内聚合信息。此外，该模型将可变形交叉注意力从二维扩展到三维像素空间，使得能够根据深度坐标区分具有相似图像坐标的点。基于此模块，我们引入了一个名为CGFormer的神经网络来实现语义场景补全。同时，CGFormer利用多种三维表示（即体素和TPV），从局部和全局两个角度提升转换后三维体量的语义与几何表示能力。实验结果表明，CGFormer在SemanticKITTI和SSCBench-KITTI-360基准测试中取得了最先进的性能，分别获得了16.87和20.05的mIoU，以及45.99和48.07的IoU。值得注意的是，CGFormer甚至优于那些使用时序图像作为输入或采用更大图像骨干网络的方法。