VoxelTrack: Exploring Voxel Representation for 3D Point Cloud Object Tracking

Current LiDAR point cloud-based 3D single object tracking (SOT) methods typically rely on point-based representation network. Despite demonstrated success, such networks suffer from some fundamental problems: 1) It contains pooling operation to cope with inherently disordered point clouds, hindering the capture of 3D spatial information that is useful for tracking, a regression task. 2) The adopted set abstraction operation hardly handles density-inconsistent point clouds, also preventing 3D spatial information from being modeled. To solve these problems, we introduce a novel tracking framework, termed VoxelTrack. By voxelizing inherently disordered point clouds into 3D voxels and extracting their features via sparse convolution blocks, VoxelTrack effectively models precise and robust 3D spatial information, thereby guiding accurate position prediction for tracked objects. Moreover, VoxelTrack incorporates a dual-stream encoder with cross-iterative feature fusion module to further explore fine-grained 3D spatial information for tracking. Benefiting from accurate 3D spatial information being modeled, our VoxelTrack simplifies tracking pipeline with a single regression loss. Extensive experiments are conducted on three widely-adopted datasets including KITTI, NuScenes and Waymo Open Dataset. The experimental results confirm that VoxelTrack achieves state-of-the-art performance (88.3%, 71.4% and 63.6% mean precision on the three datasets, respectively), and outperforms the existing trackers with a real-time speed of 36 Fps on a single TITAN RTX GPU. The source code and model will be released.

翻译：当前基于LiDAR点云的三维单目标跟踪方法通常依赖于基于点的表示网络。尽管已取得显著成功，此类网络仍存在一些根本性问题：1）为处理点云固有的无序性，网络包含池化操作，这阻碍了对跟踪（一种回归任务）有益的3D空间信息的捕获。2）所采用的集合抽象操作难以处理密度不一致的点云，同样阻碍了3D空间信息的建模。为解决这些问题，我们提出了一种新颖的跟踪框架VoxelTrack。通过将固有无序的点云体素化为3D体素，并利用稀疏卷积块提取其特征，VoxelTrack有效建模了精确且鲁棒的3D空间信息，从而指导被跟踪目标的精准位置预测。此外，VoxelTrack集成了一个带有跨迭代特征融合模块的双流编码器，以进一步挖掘用于跟踪的细粒度3D空间信息。得益于精确3D空间信息的建模，我们的VoxelTrack简化了跟踪流程，仅需单一回归损失。我们在KITTI、NuScenes和Waymo Open Dataset三个广泛使用的数据集上进行了大量实验。实验结果证实，VoxelTrack实现了最先进的性能（在三个数据集上的平均精度分别为88.3%、71.4%和63.6%），并以在单块TITAN RTX GPU上36 FPS的实时速度超越了现有跟踪器。源代码与模型将予以公开。