3D single object tracking (SOT) in point clouds is still a challenging problem due to appearance variation, distractors, and high sparsity of point clouds. Notably, in autonomous driving scenarios, the target object typically maintains spatial adjacency across consecutive frames, predominantly moving horizontally. This spatial continuity offers valuable prior knowledge for target localization. However, existing trackers, which often employ point-wise representations, struggle to efficiently utilize this knowledge owing to the irregular format of such representations. Consequently, they require elaborate designs and solving multiple subtasks to establish spatial correspondence. In this paper, we introduce BEVTrack, a simple yet strong baseline framework for 3D SOT. After converting consecutive point clouds into the common Bird's-Eye View representation, BEVTrack inherently encodes spatial proximity and adeptly captures motion cues for tracking via a simple element-wise operation and convolutional layers. Additionally, to better deal with objects having diverse sizes and moving patterns, BEVTrack directly learns the underlying motion distribution rather than making a fixed Laplacian or Gaussian assumption as in previous works. Without bells and whistles, BEVTrack achieves state-of-the-art performance on KITTI and NuScenes datasets while maintaining a high inference speed of 122 FPS. The code will be released at https://github.com/xmm-prio/BEVTrack.
翻译:3D点云中的单目标跟踪(SOT)由于外观变化、干扰因素以及点云的高稀疏性仍是一个具有挑战性的问题。值得注意的是,在自动驾驶场景中,目标物体通常在连续帧之间保持空间相邻性,且主要进行水平运动。这种空间连续性为目标定位提供了有价值的先验知识。然而,现有跟踪器通常采用点级表示,由于这种表示的不规则格式,难以有效利用该知识。因此,它们需要复杂的设计和解决多个子任务来建立空间对应关系。本文提出BEVTrack,一种简单而强大的3D SOT基线框架。将连续点云转换为常见的鸟瞰图(Bird's-Eye View)表示后,BEVTrack通过简单的逐元素操作和卷积层,自然地编码了空间邻近性并巧妙捕捉运动线索。此外,为更好处理具有不同尺寸和运动模式的目标,BEVTrack直接学习潜在的運動分布,而非如先前工作那样假设固定的拉普拉斯或高斯分布。无需额外复杂设计,BEVTrack在KITTI和NuScenes数据集上实现了最先进的性能,同时保持122 FPS的高推理速度。代码将发布于https://github.com/xmm-prio/BEVTrack。