Accurate 3D object detection in large-scale outdoor scenes, characterized by considerable variations in object scales, necessitates features rich in both long-range and fine-grained information. While recent detectors have utilized window-based transformers to model long-range dependencies, they tend to overlook fine-grained details. To bridge this gap, we propose MsSVT++, an innovative Mixed-scale Sparse Voxel Transformer that simultaneously captures both types of information through a divide-and-conquer approach. This approach involves explicitly dividing attention heads into multiple groups, each responsible for attending to information within a specific range. The outputs of these groups are subsequently merged to obtain final mixed-scale features. To mitigate the computational complexity associated with applying a window-based transformer in 3D voxel space, we introduce a novel Chessboard Sampling strategy and implement voxel sampling and gathering operations sparsely using a hash map. Moreover, an important challenge stems from the observation that non-empty voxels are primarily located on the surface of objects, which impedes the accurate estimation of bounding boxes. To overcome this challenge, we introduce a Center Voting module that integrates newly voted voxels enriched with mixed-scale contextual information towards the centers of the objects, thereby improving precise object localization. Extensive experiments demonstrate that our single-stage detector, built upon the foundation of MsSVT++, consistently delivers exceptional performance across diverse datasets.
翻译:大规模室外场景中的三维目标检测需兼顾物体尺度差异,要求特征同时包含长程信息与细粒度细节。现有检测器虽采用窗口式Transformer建模长程依赖,却容易忽略细粒度特征。为解决此问题,我们提出MsSVT++——一种创新的混合尺度稀疏体素Transformer,通过分治策略同步捕获两类信息。该策略将注意力头显式划分为多个组,每组负责关注特定范围内的信息,随后合并各组输出以获取最终混合尺度特征。为缓解窗口式Transformer在三维体素空间中的计算复杂度,我们引入新颖的棋盘采样策略,并利用哈希映射稀疏实现体素采样与聚合操作。此外,一个关键挑战在于非空体素主要分布于物体表面,这阻碍了边界框的精准估计。针对这一问题,我们提出中心投票模块,通过向物体中心整合含混合尺度上下文信息的新投票体素,提升目标定位精度。大量实验表明,基于MsSVT++构建的单阶段检测器在多个数据集上持续展现卓越性能。