3D Single Object Tracking (SOT) is a fundamental task of computer vision, proving essential for applications like autonomous driving. It remains challenging to localize the target from surroundings due to appearance variations, distractors, and the high sparsity of point clouds. To address these issues, prior Siamese and motion-centric trackers both require elaborate designs and solving multiple subtasks. In this paper, we propose BEVTrack, a simple yet effective baseline method. By estimating the target motion in Bird's-Eye View (BEV) to perform tracking, BEVTrack demonstrates surprising simplicity from various aspects, i.e., network designs, training objectives, and tracking pipeline, while achieving superior performance. Besides, to achieve accurate regression for targets with diverse attributes (e.g., sizes and motion patterns), BEVTrack constructs the likelihood function with the learned underlying distributions adapted to different targets, rather than making a fixed Laplacian or Gaussian assumption as in previous works. This provides valuable priors for tracking and thus further boosts performance. While only using a single regression loss with a plain convolutional architecture, BEVTrack achieves state-of-the-art performance on three large-scale datasets, KITTI, NuScenes, and Waymo Open Dataset while maintaining a high inference speed of about 200 FPS. The code will be released at https://github.com/xmm-prio/BEVTrack.
翻译:3D单目标跟踪是计算机视觉中的一项基础任务,对自动驾驶等应用至关重要。由于外观变化、干扰物以及点云的高度稀疏性,从周围环境中定位目标仍具有挑战性。为解决这些问题,先前的孪生网络和运动为中心的跟踪器都需要精心设计并解决多个子任务。本文提出BEVTrack——一种简洁而有效的基线方法。通过估计鸟瞰视角下的目标运动来执行跟踪,BEVTrack在网络设计、训练目标和跟踪流程等方面展现出惊人的简洁性,同时实现了卓越的性能。此外,为准确回归具有不同属性(如尺寸和运动模式)的目标,BEVTrack利用学习到的适应不同目标的潜在分布构建似然函数,而非像先前工作那样采用固定的拉普拉斯或高斯假设。这为跟踪提供了有价值的先验信息,从而进一步提升性能。仅使用单一回归损失和普通卷积架构,BEVTrack在KITTI、NuScenes和Waymo Open Dataset三个大规模数据集上实现了最先进的性能,同时保持约200 FPS的高推理速度。代码将发布于https://github.com/xmm-prio/BEVTrack。