3D single object tracking (SOT) in point clouds is still a challenging problem due to appearance variation, distractors, and high sparsity of point clouds. Notably, in autonomous driving scenarios, the target object typically maintains spatial adjacency across consecutive frames, predominantly moving horizontally. This spatial continuity offers valuable prior knowledge for target localization. However, existing trackers, which often employ point-wise representations, struggle to efficiently utilize this knowledge owing to the irregular format of such representations. Consequently, they require elaborate designs and solving multiple subtasks to establish spatial correspondence. In this paper, we introduce BEVTrack, a simple yet strong baseline framework for 3D SOT. After converting consecutive point clouds into the common Bird's-Eye-View representation, BEVTrack inherently encodes spatial proximity and adeptly captures motion cues for tracking via a simple element-wise operation and convolutional layers. Additionally, to better deal with objects having diverse sizes and moving patterns, BEVTrack directly learns the underlying motion distribution rather than making a fixed Laplacian or Gaussian assumption as in previous works. Without bells and whistles, BEVTrack achieves state-of-the-art performance on KITTI and NuScenes datasets while maintaining a high inference speed of 122 FPS. The code will be released at https://github.com/xmm-prio/BEVTrack.
翻译:3D单目标跟踪(SOT)在点云中仍是一个具有挑战性的问题,原因在于点云的外观变化、干扰因素及高度稀疏性。值得注意的是,在自动驾驶场景中,目标物体通常在连续帧之间保持空间邻接性,且主要沿水平方向运动。这种空间连续性为目标定位提供了宝贵的先验知识。然而,现有跟踪器通常采用逐点表示法,由于此类表示的不规则格式,难以有效利用这一知识。因此,它们需要精心设计并解决多个子任务来建立空间对应关系。本文提出BEVTrack,一种用于3D SOT的简单而强大的基线框架。将连续点云转换为通用的鸟瞰视角表示后,BEVTrack通过简单的逐元素操作和卷积层,自然地编码空间邻近性并灵活捕捉跟踪所需的运动线索。此外,为更好处理不同尺寸和运动模式的目标,BEVTrack直接学习潜在运动分布,而非像先前工作那样假设固定的拉普拉斯分布或高斯分布。无需复杂技巧,BEVTrack在KITTI和NuScenes数据集上实现了最先进的性能,同时保持122 FPS的高推理速度。代码将发布于https://github.com/xmm-prio/BEVTrack。