Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection. However, serializing 3D voxels into 1D sequences will inevitably sacrifice the voxel spatial proximity. Such an issue is hard to be addressed by enlarging the group size with existing serialization-based methods due to the quadratic complexity of Transformers with feature sizes. Inspired by the recent advances of state space models (SSMs), we present a Voxel SSM, termed as Voxel Mamba, which employs a group-free strategy to serialize the whole space of voxels into a single sequence. The linear complexity of SSMs encourages our group-free design, alleviating the loss of spatial proximity of voxels. To further enhance the spatial proximity, we propose a Dual-scale SSM Block to establish a hierarchical structure, enabling a larger receptive field in the 1D serialization curve, as well as more complete local regions in 3D space. Moreover, we implicitly apply window partition under the group-free framework by positional encoding, which further enhances spatial proximity by encoding voxel positional information. Our experiments on Waymo Open Dataset and nuScenes dataset show that Voxel Mamba not only achieves higher accuracy than state-of-the-art methods, but also demonstrates significant advantages in computational efficiency.
翻译:基于序列化的方法(即将三维体素序列化并分组为多个序列后再输入Transformer)已在三维目标检测中展现出其有效性。然而,将三维体素序列化为一维序列不可避免地会牺牲体素的空间邻近性。由于Transformer随特征尺寸的二次复杂度,现有基于序列化的方法难以通过扩大分组规模来解决此问题。受状态空间模型(SSMs)最新进展的启发,我们提出了一种体素状态空间模型,称为Voxel Mamba,它采用无分组策略将整个体素空间序列化为单个序列。SSMs的线性复杂度支持我们的无分组设计,从而缓解了体素空间邻近性的损失。为进一步增强空间邻近性,我们提出了一种双尺度SSM模块以建立层次化结构,从而在一维序列化曲线中获得更大的感受野,并在三维空间中形成更完整的局部区域。此外,我们在无分组框架下通过位置编码隐式应用窗口划分,进一步通过编码体素位置信息来增强空间邻近性。我们在Waymo Open Dataset和nuScenes数据集上的实验表明,Voxel Mamba不仅达到了比现有最先进方法更高的精度,同时在计算效率方面展现出显著优势。