Compact Transformer Tracker with Correlative Masked Modeling

Transformer framework has been showing superior performances in visual object tracking for its great strength in information aggregation across the template and search image with the well-known attention mechanism. Most recent advances focus on exploring attention mechanism variants for better information aggregation. We find these schemes are equivalent to or even just a subset of the basic self-attention mechanism. In this paper, we prove that the vanilla self-attention structure is sufficient for information aggregation, and structural adaption is unnecessary. The key is not the attention structure, but how to extract the discriminative feature for tracking and enhance the communication between the target and search image. Based on this finding, we adopt the basic vision transformer (ViT) architecture as our main tracker and concatenate the template and search image for feature embedding. To guide the encoder to capture the invariant feature for tracking, we attach a lightweight correlative masked decoder which reconstructs the original template and search image from the corresponding masked tokens. The correlative masked decoder serves as a plugin for the compact transform tracker and is skipped in inference. Our compact tracker uses the most simple structure which only consists of a ViT backbone and a box head, and can run at 40 fps. Extensive experiments show the proposed compact transform tracker outperforms existing approaches, including advanced attention variants, and demonstrates the sufficiency of self-attention in tracking tasks. Our method achieves state-of-the-art performance on five challenging datasets, along with the VOT2020, UAV123, LaSOT, TrackingNet, and GOT-10k benchmarks. Our project is available at https://github.com/HUSTDML/CTTrack.

翻译：Transformer框架凭借其著名的注意力机制在模板与搜索图像间的信息聚合方面展现出卓越性能，已成为视觉目标跟踪领域的领先方案。近期研究主要集中于探索注意力机制变体以提升信息聚合效果。我们发现这些方案本质上等价于甚至仅是基础自注意力机制的子集。本文证明：原始自注意力结构足以实现信息聚合，无需进行结构适配。关键在于注意力结构本身，而在于如何提取跟踪所需的判别性特征并增强目标与搜索图像间的交互。基于这一发现，我们采用基础视觉Transformer（ViT）架构作为主跟踪器，通过拼接模板与搜索图像进行特征嵌入。为引导编码器捕获跟踪任务中的不变特征，我们附加了一个轻量级掩码相关解码器，通过重建对应掩码令牌的原始模板与搜索图像来指导训练。该掩码相关解码器作为紧凑型Transformer跟踪器的插件模块，推理阶段可跳过。我们的紧凑型跟踪器采用最简结构，仅由ViT主干网络和边界框头部组成，运行速度可达40帧/秒。大量实验表明，所提紧凑型Transformer跟踪器性能优于现有方法（包括先进注意力变体），充分证明了自注意力在跟踪任务中的充分性。本方法在VOT2020、UAV123、LaSOT、TrackingNet和GOT-10k五个挑战性数据集上均达到最优性能。项目代码开源地址：https://github.com/HUSTDML/CTTrack 。