3D single object tracking (SOT) in point clouds is still a challenging problem due to appearance variation, distractors, and high sparsity of point clouds. Notably, in autonomous driving scenarios, the target object typically maintains spatial adjacency across consecutive frames, predominantly moving horizontally. This spatial continuity offers valuable prior knowledge for target localization. However, existing trackers, which often employ point-wise representations, struggle to efficiently utilize this knowledge owing to the irregular format of such representations. Consequently, they require elaborate designs and solving multiple subtasks to establish spatial correspondence. In this paper, we introduce BEVTrack, a simple yet strong baseline framework for 3D SOT. After converting consecutive point clouds into the common Bird's-Eye-View representation, BEVTrack inherently encodes spatial proximity and adeptly captures motion cues for tracking via a simple element-wise operation and convolutional layers. Additionally, to better deal with objects having diverse sizes and moving patterns, BEVTrack directly learns the underlying motion distribution rather than making a fixed Laplacian or Gaussian assumption as in previous works. Without bells and whistles, BEVTrack achieves state-of-the-art performance on KITTI and NuScenes datasets while maintaining a high inference speed of 122 FPS. The code will be released at https://github.com/xmm-prio/BEVTrack.
翻译:三维点云中的单目标跟踪(SOT)因外观变化、干扰物及点云高度稀疏性仍具挑战性。值得注意的是,在自动驾驶场景中,目标物体通常在连续帧间保持空间邻近性,且主要沿水平方向运动。这种空间连续性为目标定位提供了宝贵的先验知识。然而,现有跟踪器多采用逐点表示,因其格式不规则而难以高效利用该知识,需通过复杂设计及求解多个子任务来建立空间对应关系。本文提出BEVTrack——一个简洁而强大的三维单目标跟踪基线框架。将连续点云转换为通用的鸟瞰视角表示后,BEVTrack可自然编码空间邻近性,并通过简单的逐元素操作与卷积层巧妙捕捉运动线索。此外,为更好处理不同尺寸与运动模式的目标,BEVTrack直接学习潜在运动分布,而非如先前工作般假设固定的拉普拉斯或高斯分布。无需繁复设计,BEVTrack在KITTI和NuScenes数据集上达到最优性能,同时保持122 FPS的高推理速度。代码将开源至https://github.com/xmm-prio/BEVTrack。