The deployment of transformers for visual object tracking has shown state-of-the-art results on several benchmarks. However, the transformer-based models are under-utilized for Siamese lightweight tracking due to the computational complexity of their attention blocks. This paper proposes an efficient self and mixed attention transformer-based architecture for lightweight tracking. The proposed backbone utilizes the separable mixed attention transformers to fuse the template and search regions during feature extraction to generate superior feature encoding. Our prediction head performs global contextual modeling of the encoded features by leveraging efficient self-attention blocks for robust target state estimation. With these contributions, the proposed lightweight tracker deploys a transformer-based backbone and head module concurrently for the first time. Our ablation study testifies to the effectiveness of the proposed combination of backbone and head modules. Simulations show that our Separable Self and Mixed Attention-based Tracker, SMAT, surpasses the performance of related lightweight trackers on GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets, while running at 37 fps on CPU, 158 fps on GPU, and having 3.8M parameters. For example, it significantly surpasses the closely related trackers E.T.Track and MixFormerV2-S on GOT10k-test by a margin of 7.9% and 5.8%, respectively, in the AO metric. The tracker code and model is available at https://github.com/goutamyg/SMAT
翻译:在视觉目标跟踪中部署变压器架构已在多个基准上展现出最先进的性能。然而,由于注意力模块的计算复杂性,基于变压器的模型在轻量级孪生网络跟踪中未能得到充分利用。本文提出一种高效的自注意与混合注意力变压器架构,用于轻量级跟踪任务。所提出的骨干网络采用可分离混合注意力变压器,在特征提取阶段融合模板与搜索区域以生成更优的特征编码。预测头通过利用高效自注意模块对编码特征进行全局上下文建模,从而实现稳健的目标状态估计。基于上述贡献,所提出的轻量级跟踪器首次同时部署了基于变压器的骨干网络与预测头模块。消融实验验证了所提出的骨干与预测头组合方案的有效性。仿真结果表明,基于可分离自注意与混合注意力的跟踪器SMAT在GOT10k、TrackingNet、LaSOT、NfS30、UAV123和AVisT数据集上均超越相关轻量级跟踪器的性能,同时在CPU上达到37帧/秒、GPU上达158帧/秒的运行速度,参数量仅为3.8M。例如,在GOT10k测试集上,其平均重叠率指标显著超越同类跟踪器E.T.Track和MixFormerV2-S,分别提升7.9%和5.8%。跟踪器代码与模型已开源至https://github.com/goutamyg/SMAT