3D Single Object Tracking (SOT) is a fundamental task of computer vision, proving essential for applications like autonomous driving. It remains challenging to localize the target from surroundings due to appearance variations, distractors, and the high sparsity of point clouds. To address these issues, prior Siamese and motion-centric trackers both require elaborate designs and solving multiple subtasks. In this paper, we propose BEVTrack, a simple yet effective baseline method. By estimating the target motion in Bird's-Eye View (BEV) to perform tracking, BEVTrack demonstrates surprising simplicity from various aspects, i.e., network designs, training objectives, and tracking pipeline, while achieving superior performance. Besides, to achieve accurate regression for targets with diverse attributes (e.g., sizes and motion patterns), BEVTrack constructs the likelihood function with the learned underlying distributions adapted to different targets, rather than making a fixed Laplacian or Gaussian assumption as in previous works. This provides valuable priors for tracking and thus further boosts performance. While only using a single regression loss with a plain convolutional architecture, BEVTrack achieves state-of-the-art performance on three large-scale datasets, KITTI, NuScenes, and Waymo Open Dataset while maintaining a high inference speed of about 200 FPS. The code will be released at https://github.com/xmm-prio/BEVTrack.
翻译:三维单目标跟踪是计算机视觉的一项基础任务,在自动驾驶等应用中至关重要。由于外观变化、干扰物以及点云的高度稀疏性,从周围环境中定位目标仍然具有挑战性。为解决这些问题,现有的孪生网络跟踪器和以运动为中心的跟踪器均需要精心设计网络并解决多个子任务。本文提出BEVTrack,一种简洁而有效的基线方法。该方法通过在鸟瞰图中估计目标运动来实现跟踪,BEVTrack在网络设计、训练目标和跟踪流程等多个方面展现出惊人的简洁性,同时实现了卓越的性能。此外,为了对具有不同属性(如尺寸和运动模式)的目标实现精确回归,BEVTrack利用学习到的、适应不同目标的底层分布构建似然函数,而非像先前工作那样采用固定的拉普拉斯或高斯分布假设。这为跟踪提供了有价值的先验信息,从而进一步提升了性能。尽管仅使用单一回归损失和平凡的卷积架构,BEVTrack在KITTI、NuScenes和Waymo Open Dataset三个大规模数据集上达到了最先进的性能,同时保持约200 FPS的高推理速度。代码将在https://github.com/xmm-prio/BEVTrack发布。