Visual tracking aims to automatically estimate the state of a target object in a video sequence, which is challenging especially in dynamic scenarios. Thus, numerous methods are proposed to introduce temporal cues to enhance tracking robustness. However, conventional CNN and Transformer architectures exhibit inherent limitations in modeling long-range temporal dependencies in visual tracking, often necessitating either complex customized modules or substantial computational costs to integrate temporal cues. Inspired by the success of the state space model, we propose a novel temporal modeling paradigm for visual tracking, termed State-aware Mamba Tracker (SMTrack), providing a neat pipeline for training and tracking without needing customized modules or substantial computational costs to build long-range temporal dependencies. It enjoys several merits. First, we propose a novel selective state-aware space model with state-wise parameters to capture more diverse temporal cues for robust tracking. Second, SMTrack facilitates long-range temporal interactions with linear computational complexity during training. Third, SMTrack enables each frame to interact with previously tracked frames via hidden state propagation and updating, which releases computational costs of handling temporal cues during tracking. Extensive experimental results demonstrate that SMTrack achieves promising performance with low computational costs.
翻译:视觉跟踪旨在自动估计视频序列中目标对象的状态,这在动态场景中尤为困难。因此,众多方法被提出以引入时序线索来增强跟踪鲁棒性。然而,传统的CNN和Transformer架构在建模视觉跟踪中的长程时序依赖方面存在固有局限,通常需要复杂的定制模块或高昂的计算成本来整合时序线索。受状态空间模型成功的启发,我们提出了一种新颖的视觉跟踪时序建模范式,称为状态感知Mamba跟踪器(SMTrack),为训练和跟踪提供了一个简洁的流程,无需定制模块或高昂计算成本即可建立长程时序依赖。它具有以下优点:首先,我们提出了一种新颖的选择性状态感知空间模型,其参数随状态变化,以捕获更多样化的时序线索,实现鲁棒跟踪。其次,SMTrack在训练期间以线性计算复杂度促进长程时序交互。第三,SMTrack通过隐藏状态的传播与更新,使每一帧都能与先前跟踪的帧进行交互,从而降低了跟踪过程中处理时序线索的计算开销。大量实验结果表明,SMTrack以较低的计算成本实现了优异的性能。