Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection. However, serializing 3D voxels into 1D sequences will inevitably sacrifice the voxel spatial proximity. Such an issue is hard to be addressed by enlarging the group size with existing serialization-based methods due to the quadratic complexity of Transformers with feature sizes. Inspired by the recent advances of state space models (SSMs), we present a Voxel SSM, termed as Voxel Mamba, which employs a group-free strategy to serialize the whole space of voxels into a single sequence. The linear complexity of SSMs encourages our group-free design, alleviating the loss of spatial proximity of voxels. To further enhance the spatial proximity, we propose a Dual-scale SSM Block to establish a hierarchical structure, enabling a larger receptive field in the 1D serialization curve, as well as more complete local regions in 3D space. Moreover, we implicitly apply window partition under the group-free framework by positional encoding, which further enhances spatial proximity by encoding voxel positional information. Our experiments on Waymo Open Dataset and nuScenes dataset show that Voxel Mamba not only achieves higher accuracy than state-of-the-art methods, but also demonstrates significant advantages in computational efficiency.
翻译:基于序列化的方法,即将三维体素序列化并分组为多个序列后再输入到Transformer中,已在三维目标检测中证明了其有效性。然而,将三维体素序列化为一维序列不可避免地会牺牲体素的空间邻近性。由于Transformer随特征尺寸呈二次复杂度,通过扩大分组规模来解决此问题对于现有基于序列化的方法而言十分困难。受状态空间模型(SSMs)最新进展的启发,我们提出了一种体素状态空间模型,称为Voxel Mamba,它采用无分组策略将整个体素空间序列化为单个序列。SSMs的线性复杂度支持了我们的无分组设计,从而缓解了体素空间邻近性的损失。为了进一步增强空间邻近性,我们提出了双尺度SSM模块以建立分层结构,从而在一维序列化曲线上获得更大的感受野,并在三维空间中形成更完整的局部区域。此外,我们在无分组框架下通过位置编码隐式地应用窗口划分,这通过编码体素位置信息进一步增强了空间邻近性。我们在Waymo Open Dataset和nuScenes数据集上的实验表明,Voxel Mamba不仅达到了比最先进方法更高的精度,而且在计算效率方面展现出显著优势。