GLT-T++: Global-Local Transformer for 3D Siamese Tracking with Ranking Loss

Siamese trackers based on 3D region proposal network (RPN) have shown remarkable success with deep Hough voting. However, using a single seed point feature as the cue for voting fails to produce high-quality 3D proposals. Additionally, the equal treatment of seed points in the voting process, regardless of their significance, exacerbates this limitation. To address these challenges, we propose a novel transformer-based voting scheme to generate better proposals. Specifically, a global-local transformer (GLT) module is devised to integrate object- and patch-aware geometric priors into seed point features, resulting in robust and accurate cues for offset learning of seed points. To train the GLT module, we introduce an importance prediction branch that learns the potential importance weights of seed points as a training constraint. Incorporating this transformer-based voting scheme into 3D RPN, a novel Siamese method dubbed GLT-T is developed for 3D single object tracking on point clouds. Moreover, we identify that the highest-scored proposal in the Siamese paradigm may not be the most accurate proposal, which limits tracking performance. Towards this concern, we approach the binary score prediction task as a ranking problem, and design a target-aware ranking loss and a localization-aware ranking loss to produce accurate ranking of proposals. With the ranking losses, we further present GLT-T++, an enhanced version of GLT-T. Extensive experiments on multiple benchmarks demonstrate that our GLT-T and GLT-T++ outperform state-of-the-art methods in terms of tracking accuracy while maintaining a real-time inference speed. The source code will be made available at https://github.com/haooozi/GLT-T.

翻译：基于三维区域提议网络（RPN）的孪生跟踪器通过深度霍夫投票取得了显著成功。然而，使用单一种子点特征作为投票线索难以生成高质量的三维提议，同时，投票过程中对所有种子点平等对待而忽视其重要性，进一步加剧了这一局限性。为解决这些问题，我们提出一种新型基于Transformer的投票方案以生成更优提议。具体而言，设计了一种全局-局部Transformer（GLT）模块，将目标感知与局部感知几何先验融入种子点特征，为种子点偏移学习提供鲁棒且精确的线索。为训练GLT模块，引入重要性预测分支，学习种子点的潜在重要性权重作为训练约束。将该基于Transformer的投票方案集成至三维RPN后，提出一种新型孪生方法GLT-T，用于点云中的三维单目标跟踪。此外，我们发现孪生框架中得分最高的提议未必是最精确的提议，这限制了跟踪性能。针对该问题，我们将二元评分预测任务转化为排序问题，设计了一种目标感知排序损失和定位感知排序损失，以生成精确的提议排序。基于这些排序损失，进一步提出GLT-T++作为GLT-T的增强版本。在多个基准数据集上的大量实验表明，我们的GLT-T和GLT-T++在保持实时推理速度的同时，在跟踪精度上超越了现有最先进方法。源代码将在https://github.com/haooozi/GLT-T 公开。