Compared with previous two-stream trackers, the recent one-stream tracking pipeline, which allows earlier interaction between the template and search region, has achieved a remarkable performance gain. However, existing one-stream trackers always let the template interact with all parts inside the search region throughout all the encoder layers. This could potentially lead to target-background confusion when the extracted feature representations are not sufficiently discriminative. To alleviate this issue, we propose a generalized relation modeling method based on adaptive token division. The proposed method is a generalized formulation of attention-based relation modeling for Transformer tracking, which inherits the merits of both previous two-stream and one-stream pipelines whilst enabling more flexible relation modeling by selecting appropriate search tokens to interact with template tokens. An attention masking strategy and the Gumbel-Softmax technique are introduced to facilitate the parallel computation and end-to-end learning of the token division module. Extensive experiments show that our method is superior to the two-stream and one-stream pipelines and achieves state-of-the-art performance on six challenging benchmarks with a real-time running speed.
翻译:与以往的双流跟踪器相比,近期提出的单流水线跟踪框架允许模板与搜索区域提前交互,取得了显著的性能提升。然而现有单流跟踪器始终让模板在所有编码器层中与搜索区域的全部部分进行交互,当提取的特征表征不够具有判别性时,这可能导致目标-背景混淆。为缓解此问题,我们提出一种基于自适应令牌划分的广义关系建模方法。该方法是对基于注意力的Transformer跟踪关系建模的广义化表述,既继承了先前双流与单流水线的优点,又可通过选择合适搜索令牌与模板令牌交互实现更灵活的关系建模。我们引入注意力掩码策略和Gumbel-Softmax技术,以促进令牌划分模块的并行计算与端到端学习。大量实验表明,我们的方法优于双流与单流水线,在六个具有挑战性的基准测试中以实时运行速度取得了最先进性能。