Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former one aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter one is used to prune some layers of the backbone. We instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC of 70.6\% on LaSOT and an AUC of 57.4\% on TNL2k with a high GPU speed of 165 FPS, and the MixFormerV2-S surpasses FEAR-L by 2.7\% AUC on LaSOT with a real-time CPU speed.
翻译:基于Transformer的跟踪器在标准基准测试上取得了强大的精度。然而,其效率问题仍然是其在GPU和CPU平台上实际部署的障碍。在本文中,为解决这一问题,我们提出了一个全Transformer跟踪框架,命名为MixFormerV2,该框架无需任何密集卷积操作和复杂的分数预测模块。我们的关键设计是引入四个特殊的预测令牌,并将其与来自目标模板和搜索区域的令牌拼接。然后,我们在这些混合令牌序列上应用统一的Transformer主干网络。这些预测令牌能够通过混合注意力捕获目标模板与搜索区域之间的复杂关联。基于它们,我们可以通过简单的MLP头部轻松预测跟踪框并估计其置信度分数。为进一步提升MixFormerV2的效率,我们提出了一种新的基于蒸馏的模型缩减范式,包括密集到稀疏蒸馏和深层到浅层蒸馏。前者旨在将知识从基于密集头部的MixViT迁移到我们的全Transformer跟踪器,而后者则用于剪枝主干网络的某些层。我们实例化了两种MixFormerV2模型,其中MixFormerV2-B在LaSOT上达到70.6%的AUC,在TNL2k上达到57.4%的AUC,GPU速度高达165 FPS;而MixFormerV2-S在LaSOT上以实时CPU速度超过FEAR-L 2.7%的AUC。