This paper introduces InverseMatrixVT3D, an efficient method for transforming multi-view image features into 3D feature volumes for 3D semantic occupancy prediction. Existing methods for constructing 3D volumes often rely on depth estimation, device-specific operators, or transformer queries, which hinders the widespread adoption of 3D occupancy models. In contrast, our approach leverages two projection matrices to store the static mapping relationships and matrix multiplications to efficiently generate global Bird's Eye View (BEV) features and local 3D feature volumes. Specifically, we achieve this by performing matrix multiplications between multi-view image feature maps and two sparse projection matrices. We introduce a sparse matrix handling technique for the projection matrices to optimize GPU memory usage. Moreover, a global-local attention fusion module is proposed to integrate the global BEV features with the local 3D feature volumes to obtain the final 3D volume. We also employ a multi-scale supervision mechanism to enhance performance further. Extensive experiments performed on the nuScenes and SemanticKITTI datasets reveal that our approach not only stands out for its simplicity and effectiveness but also achieves the top performance in detecting vulnerable road users (VRU), crucial for autonomous driving and road safety. The code has been made available at: https://github.com/DanielMing123/InverseMatrixVT3D
翻译:本文提出InverseMatrixVT3D,一种将多视图图像特征高效转换为三维特征体的方法,用于三维语义占据预测。现有构建三维体的方法常依赖深度估计、设备特定算子或Transformer查询,这阻碍了三维占据模型的广泛采用。相比之下,我们的方法利用两个投影矩阵存储静态映射关系,并通过矩阵乘法高效生成全局鸟瞰图(BEV)特征和局部三维特征体。具体而言,我们通过多视图图像特征图与两个稀疏投影矩阵之间的矩阵乘法实现这一目标。我们引入一种针对投影矩阵的稀疏矩阵处理技术以优化GPU内存使用。此外,提出全局-局部注意力融合模块,将全局BEV特征与局部三维特征体整合以获取最终三维体。我们还采用多尺度监督机制进一步提升性能。在nuScenes和SemanticKITTI数据集上进行的大量实验表明,我们的方法不仅因其简洁性和有效性而出众,还在检测易受伤害道路使用者(VRU)方面达到顶尖性能,这对自动驾驶和道路安全至关重要。代码已开源在:https://github.com/DanielMing123/InverseMatrixVT3D