Recent Transformer-based visual tracking models have showcased superior performance. Nevertheless, prior works have been resource-intensive, requiring prolonged GPU training hours and incurring high GFLOPs during inference due to inefficient training methods and convolution-based target heads. This intensive resource use renders them unsuitable for real-world applications. In this paper, we present DETRack, a streamlined end-to-end visual object tracking framework. Our framework utilizes an efficient encoder-decoder structure where the deformable transformer decoder acting as a target head, achieves higher sparsity than traditional convolution heads, resulting in decreased GFLOPs. For training, we introduce a novel one-to-many label assignment and an auxiliary denoising technique, significantly accelerating model's convergence. Comprehensive experiments affirm the effectiveness and efficiency of our proposed method. For instance, DETRack achieves 72.9% AO on challenging GOT-10k benchmarks using only 20% of the training epochs required by the baseline, and runs with lower GFLOPs than all the transformer-based trackers.
翻译:近期基于Transformer的视觉跟踪模型展现出卓越性能。然而,先前的工作存在资源消耗大的问题,由于低效的训练方法和基于卷积的目标头,导致GPU训练时间长、推理时GFLOPs高。这种高资源消耗使其不适合实际应用。本文提出DETRack,一个精简的端到端视觉目标跟踪框架。该框架采用高效的编码器-解码器结构,其中可变形Transformer解码器作为目标头,相比传统卷积头具有更高的稀疏性,从而降低GFLOPs。在训练方面,我们引入了一种新的一对多标签分配和辅助去噪技术,显著加速模型收敛。全面的实验验证了我们方法的有效性和高效性。例如,在具有挑战性的GOT-10k基准上,DETRack仅使用基线所需20%的训练轮次即可达到72.9%的平均重叠率(AO),且其GFLOPs低于所有基于Transformer的跟踪器。