Point cloud-based 3D object tracking is an important task in autonomous driving. Though great advances regarding Siamese-based 3D tracking have been made recently, it remains challenging to learn the correlation between the template and search branches effectively with the sparse LIDAR point cloud data. Instead of performing correlation of the two branches at just one point in the network, in this paper, we present a multi-correlation Siamese Transformer network that has multiple stages and carries out feature correlation at the end of each stage based on sparse pillars. More specifically, in each stage, self-attention is first applied to each branch separately to capture the non-local context information. Then, cross-attention is used to inject the template information into the search area. This strategy allows the feature learning of the search area to be aware of the template while keeping the individual characteristics of the template intact. To enable the network to easily preserve the information learned at different stages and ease the optimization, for the search area, we densely connect the initial input sparse pillars and the output of each stage to all subsequent stages and the target localization network, which converts pillars to bird's eye view (BEV) feature maps and predicts the state of the target with a small densely connected convolution network. Deep supervision is added to each stage to further boost the performance as well. The proposed algorithm is evaluated on the popular KITTI, nuScenes, and Waymo datasets, and the experimental results show that our method achieves promising performance compared with the state-of-the-art. Ablation study that shows the effectiveness of each component is provided as well. Code is available at https://github.com/liangp/MCSTN-3DSOT.
翻译:基于点云的三维目标跟踪是自动驾驶中的重要任务。尽管近期基于孪生网络的三维跟踪方法取得了显著进展,但如何利用稀疏的激光雷达点云数据有效学习模板分支与搜索分支之间的相关性仍具挑战。本文提出一种多相关孪生Transformer网络,该网络包含多个阶段,并非仅在网络单一节点执行两分支相关,而是在每个阶段末端基于稀疏支柱进行特征相关。具体而言,每个阶段首先对各分支分别应用自注意力机制以捕获非局部上下文信息,随后通过交叉注意力将模板信息注入搜索区域。该策略使搜索区域的特征学习能够感知模板信息,同时保持模板的个体特征不变。为使网络易于保留不同阶段学习的特征并优化训练,我们对搜索区域采用密集连接:将初始输入稀疏支柱与各阶段输出连接至所有后续阶段及目标定位网络,该网络将支柱转换为鸟瞰特征图,并通过小型密集连接卷积网络预测目标状态。各阶段还引入深度监督以进一步提升性能。所提算法在主流KITTI、nuScenes和Waymo数据集上进行了评估,实验结果表明该方法相比现有最优技术取得了具有竞争力的性能。同时提供了消融实验以验证各组件的有效性。代码已开源至https://github.com/liangp/MCSTN-3DSOT。