The speed-precision trade-off is a critical problem for visual object tracking which usually requires low latency and deployment on constrained resources. Existing solutions for efficient tracking mainly focus on adopting light-weight backbones or modules, which nevertheless come at the cost of a sacrifice in precision. In this paper, inspired by dynamic network routing, we propose DyTrack, a dynamic transformer framework for efficient tracking. Real-world tracking scenarios exhibit diverse levels of complexity. We argue that a simple network is sufficient for easy frames in video sequences, while more computation could be assigned to difficult ones. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Thus, it can achieve higher performance with the same running speed. We formulate instance-specific tracking as a sequential decision problem and attach terminating branches to intermediate layers of the entire model. Especially, to fully utilize the computations, we introduce the feature recycling mechanism to reuse the outputs of predecessors. Furthermore, a target-aware self-distillation strategy is designed to enhance the discriminating capabilities of early predictions by effectively mimicking the representation pattern of the deep model. Extensive experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model. For instance, DyTrack obtains 64.9% AUC on LaSOT with a speed of 256 fps.
翻译:速度-精度权衡是视觉目标跟踪中的关键问题,这类任务通常要求低延迟并在资源受限环境下部署。现有高效跟踪方案主要采用轻量级骨干网络或模块,但这往往以牺牲精度为代价。受动态网络路由启发,本文提出DyTrack——一种面向高效跟踪的动态Transformer框架。真实跟踪场景呈现多样化复杂度,我们认为简单网络足以处理视频序列中的简单帧,而更多计算资源可分配至复杂帧。DyTrack通过自动学习为不同输入配置合理推理路径,更高效地利用可用计算预算,从而在相同运行速度下实现更高性能。我们将实例级跟踪建模为序列化决策问题,在模型中间层附加终止分支。特别地,为充分利用计算资源,我们引入特征循环机制复用前层输出结果。此外,设计目标感知自蒸馏策略,通过有效模仿深层模型表征模式来增强早期预测的判别能力。在多个基准上的大量实验表明,DyTrack仅凭单一模型即可实现优异的速度-精度权衡。例如,DyTrack在LaSOT数据集上以256帧/秒的速度达到64.9%的AUC值。