Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12\% GFLOPs reduction, 8.9\% latency reduction, and 10.8\% energy savings while maintaining tracking accuracy within 0.2\% of the full-depth baseline across both short-term and long-term sequences.
翻译:基于Transformer的单目标跟踪器实现了最先进的精度,但依赖于固定深度的推理,即对每一帧都执行完整的编码器-解码器堆栈,而不考虑视觉复杂度,因此在以时间相干帧为主的长视频序列中产生不必要的计算成本。我们提出UncL-STARK,一种保持架构不变的方法,能够在基于Transformer的跟踪器中实现动态、不确定性感知的深度自适应,无需修改底层网络或添加辅助头。该模型通过随机深度训练与知识蒸馏进行微调,以在多个中间深度保持预测鲁棒性,从而实现安全的推理时截断。在运行时,我们直接从模型的角点定位热图中推导出轻量级不确定性估计,并将其用于反馈驱动的策略中;该策略利用视频中的时间相干性,根据预测置信度为下一帧选择编码器和解码器的深度。在GOT-10k和LaSOT上的大量实验表明,该方法在短期和长期序列中均能保持跟踪精度与全深度基线相差不超过0.2%的同时,实现高达12%的GFLOPs减少、8.9%的延迟降低和10.8%的能耗节省。