3D LiDAR-based single object tracking (SOT) has gained increasing attention as it plays a crucial role in 3D applications such as autonomous driving. The central problem is how to learn a target-aware representation from the sparse and incomplete point clouds. In this paper, we propose a novel Correlation Pyramid Network (CorpNet) with a unified encoder and a motion-factorized decoder. Specifically, the encoder introduces multi-level self attentions and cross attentions in its main branch to enrich the template and search region features and realize their fusion and interaction, respectively. Additionally, considering the sparsity characteristics of the point clouds, we design a lateral correlation pyramid structure for the encoder to keep as many points as possible by integrating hierarchical correlated features. The output features of the search region from the encoder can be directly fed into the decoder for predicting target locations without any extra matcher. Moreover, in the decoder of CorpNet, we design a motion-factorized head to explicitly learn the different movement patterns of the up axis and the x-y plane together. Extensive experiments on two commonly-used datasets show our CorpNet achieves state-of-the-art results while running in real-time.
翻译:基于三维激光雷达的单目标跟踪(SOT)在自动驾驶等三维应用中发挥着关键作用,因此日益受到关注。核心问题在于如何从稀疏且不完整的点云中学习目标感知表示。本文提出了一种新颖的相关金字塔网络(CorpNet),该网络采用统一编码器和运动分解解码器。具体而言,编码器在其主干分支中引入多层自注意力和交叉注意力,分别用于丰富模板与搜索区域特征,并实现二者的融合与交互。此外,考虑到点云的稀疏特性,我们为编码器设计了侧向相关金字塔结构,通过整合层次化相关特征来尽可能保留更多点。编码器输出的搜索区域特征可直接输入解码器进行目标位置预测,无需额外的匹配模块。同时,在CorpNet的解码器中,我们设计了运动分解头,以显式联合学习垂直轴与x-y平面的不同运动模式。在两个常用数据集上的大量实验表明,我们的CorpNet在实现实时运行的同时取得了最先进的性能。