Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. The rationale behind ABTrack is rooted in the observation that semantic features or relations do not uniformly impact the tracking task across all abstraction levels. Instead, this impact varies based on the characteristics of the target and the scene it occupies. Consequently, disregarding insignificant semantic features or relations at certain abstraction levels may not significantly affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed, which adaptively simplifies the architecture of ViTs and thus speeds up the inference process. To counteract the time cost incurred by the BDMs and further enhance the efficiency of ViTs, we introduce a novel ViT pruning method to reduce the dimension of the latent representation of tokens in each transformer block. Extensive experiments on multiple tracking benchmarks validate the effectiveness and generality of the proposed method and show that it achieves state-of-the-art performance. Code is released at: https://github.com/xyyang317/ABTrack.
翻译:得益于基于Transformer的模型,视觉跟踪技术已取得显著进展。然而,当前跟踪器的运行速度较慢,限制了其在计算资源受限设备上的应用。为应对这一挑战,我们提出了ABTrack——一种自适应计算框架,通过自适应旁路Transformer模块来实现高效视觉跟踪。ABTrack的设计原理源于以下观察:语义特征或关系在不同抽象层级上对跟踪任务的影响并不均匀,而是根据目标特性及其所处场景的特征动态变化。因此,在特定抽象层级忽略次要的语义特征或关系可能不会显著影响跟踪精度。我们提出了旁路决策模块(BDM)来判断是否应旁路某个Transformer模块,从而自适应简化视觉Transformer(ViT)的架构并加速推理过程。为抵消BDM引入的时间开销并进一步提升ViT效率,我们提出了一种新颖的ViT剪枝方法,以降低各Transformer模块中令牌潜在表示的维度。在多个跟踪基准测试上的大量实验验证了所提方法的有效性和普适性,并表明其达到了最先进的性能水平。代码发布于:https://github.com/xyyang317/ABTrack。