Visual object tracking is essential to intelligent robots. Most existing approaches have ignored the online latency that can cause severe performance degradation during real-world processing. Especially for unmanned aerial vehicles (UAVs), where robust tracking is more challenging and onboard computation is limited, the latency issue can be fatal. In this work, we present a simple framework for end-to-end latency-aware tracking, i.e., end-to-end predictive visual tracking (PVT++). Unlike existing solutions that naively append Kalman Filters after trackers, PVT++ can be jointly optimized, so that it takes not only motion information but can also leverage the rich visual knowledge in most pre-trained tracker models for robust prediction. Besides, to bridge the training-evaluation domain gap, we propose a relative motion factor, empowering PVT++ to generalize to the challenging and complex UAV tracking scenes. These careful designs have made the small-capacity lightweight PVT++ a widely effective solution. Additionally, this work presents an extended latency-aware evaluation benchmark for assessing an any-speed tracker in the online setting. Empirical results on a robotic platform from the aerial perspective show that PVT++ can achieve significant performance gain on various trackers and exhibit higher accuracy than prior solutions, largely mitigating the degradation brought by latency.
翻译:视觉目标跟踪对智能机器人至关重要。大多数现有方法忽略了在线延迟,这种延迟在实际处理过程中可能导致严重的性能下降。特别是在无人机(UAV)场景中,鲁棒跟踪更具挑战性且机载计算能力有限,延迟问题可能变得致命。本文提出了一种简单的端到端延迟感知跟踪框架,即端到端预测性视觉跟踪(PVT++)。与现有方法在跟踪器后简单追加卡尔曼滤波器的做法不同,PVT++能够进行联合优化,不仅利用运动信息,还能充分利用大多数预训练跟踪模型中的丰富视觉知识进行鲁棒预测。此外,为弥合训练与评估之间的领域差异,我们提出了一个相对运动因子,使PVT++能够泛化到具有挑战性和复杂性的无人机跟踪场景。这些精心设计使得小容量轻量级PVT++成为一种广泛有效的解决方案。同时,本文提出了一个扩展的延迟感知评估基准,用于在线设置下评估任意速度的跟踪器。在从空中视角进行的机器人平台上的实验结果表明,PVT++能够在各种跟踪器上实现显著的性能提升,并展现出比先前解决方案更高的精度,在很大程度上缓解了延迟带来的性能退化。