3D single object tracking with point clouds is a critical task in 3D computer vision. Previous methods usually input the last two frames and use the predicted box to get the template point cloud in previous frame and the search area point cloud in the current frame respectively, then use similarity-based or motion-based methods to predict the current box. Although these methods achieved good tracking performance, they ignore the historical information of the target, which is important for tracking. In this paper, compared to inputting two frames of point clouds, we input multi-frame of point clouds to encode the spatio-temporal information of the target and learn the motion information of the target implicitly, which could build the correlations among different frames to track the target in the current frame efficiently. Meanwhile, rather than directly using the point feature for feature fusion, we first crop the point cloud features into many patches and then use sparse attention mechanism to encode the patch-level similarity and finally fuse the multi-frame features. Extensive experiments show that our method achieves competitive results on challenging large-scale benchmarks (62.6% in KITTI and 49.66% in NuScenes).
翻译:3D单目标跟踪是三维计算机视觉中的关键任务。现有方法通常输入最近两帧点云,利用预测框分别获取前一帧的模板点云和当前帧的搜索区域点云,再通过基于相似性或运动的方法预测当前帧的检测框。尽管这些方法取得了良好的跟踪性能,但它们忽略了目标的历史信息——而这对跟踪至关重要。本文提出通过输入多帧点云(而非传统两帧)来编码目标的时空信息,并隐式学习目标的运动特征,从而构建不同帧间的关联性,实现对当前帧目标的高效跟踪。同时,不同于直接使用点特征进行特征融合,我们首先将点云特征切分为多个块,利用稀疏注意力机制编码块级相似度,最终实现多帧特征的融合。大量实验表明,本方法在具有挑战性的大规模基准测试中取得了具有竞争力的结果(KITTI数据集达62.6%,NuScenes数据集达49.66%)。