This paper introduces InverseMatrixVT3D, an efficient method for transforming multi-view image features into 3D feature volumes for 3D semantic occupancy prediction. Existing methods for constructing 3D volumes often rely on depth estimation, device-specific operators, or transformer queries, which hinders the widespread adoption of 3D occupancy models. In contrast, our approach leverages two projection matrices to store the static mapping relationships and matrix multiplications to efficiently generate global Bird's Eye View (BEV) features and local 3D feature volumes. Specifically, we achieve this by performing matrix multiplications between multi-view image feature maps and two sparse projection matrices. We introduce a sparse matrix handling technique for the projection matrices to optimise GPU memory usage. Moreover, a global-local attention fusion module is proposed to integrate the global BEV features with the local 3D feature volumes to obtain the final 3D volume. We also employ a multi-scale supervision mechanism to further enhance performance. Comprehensive experiments on the nuScenes dataset demonstrate the simplicity and effectiveness of our method. The code will be made available at:https://github.com/DanielMing123/InverseMatrixVT3D
翻译:本文提出InverseMatrixVT3D,一种将多视图图像特征高效转换为3D特征体以进行3D语义占据预测的方法。现有构建3D体的方法通常依赖深度估计、设备专用算子或Transformer查询,这阻碍了3D占据模型的广泛应用。相比之下,我们的方法利用两个投影矩阵存储静态映射关系,并通过矩阵乘法高效生成全局鸟瞰图(BEV)特征和局部3D特征体。具体来说,我们通过多视图图像特征图与两个稀疏投影矩阵的矩阵乘法实现这一目标。我们引入了一种针对投影矩阵的稀疏矩阵处理技术以优化GPU内存使用。此外,提出了一种全局-局部注意力融合模块,用于整合全局BEV特征与局部3D特征体以获得最终3D体。我们还采用多尺度监督机制进一步提升性能。在nuScenes数据集上的综合实验表明了我们方法的简洁性和有效性。代码将开源在:https://github.com/DanielMing123/InverseMatrixVT3D