Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former one aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter one is used to prune some layers of the backbone. We instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC of 70.6\% on LaSOT and an AUC of 57.4\% on TNL2k with a high GPU speed of 165 FPS, and the MixFormerV2-S surpasses FEAR-L by 2.7\% AUC on LaSOT with a real-time CPU speed.
翻译:基于Transformer的跟踪器在标准基准测试中已取得显著精度,但其效率问题仍是阻碍其在GPU和CPU平台实际部署的关键瓶颈。为攻克这一难题,本文提出一种不含密集卷积运算与复杂得分预测模块的全Transformer跟踪框架,命名为MixFormerV2。核心设计在于引入四个特殊预测令牌,并将其与目标模板和搜索区域的令牌拼接。我们采用统一Transformer骨干网络处理这一混合令牌序列,这些预测令牌能通过混合注意力机制捕获目标模板与搜索区域间的复杂关联,并据此通过简易MLP头部直接预测跟踪框及其置信度得分。为提升MixFormerV2的效率,我们提出新型蒸馏模型缩减范式,包含密集-稀疏蒸馏与深层-浅层蒸馏:前者旨在将基于密集头部的MixViT知识迁移至全Transformer跟踪器,后者则用于剪枝骨干网络的部分层。我们实例化两种MixFormerV2模型:其中MixFormerV2-B在LaSOT上取得70.6% AUC,在TNL2k上获得57.4% AUC,GPU运行速度高达165 FPS;而MixFormerV2-S在LaSOT上以实时CPU速度运行,其AUC较FEAR-L提升2.7%。