Predicting the Best of N Visual Trackers

We observe that the performance of SOTA visual trackers surprisingly strongly varies across different video attributes and datasets. No single tracker remains the best performer across all tracking attributes and datasets. To bridge this gap, for a given video sequence, we predict the "Best of the N Trackers", called the BofN meta-tracker. At its core, a Tracking Performance Prediction Network (TP2N) selects a predicted best performing visual tracker for the given video sequence using only a few initial frames. We also introduce a frame-level BofN meta-tracker which keeps predicting best performer after regular temporal intervals. The TP2N is based on self-supervised learning architectures MocoV2, SwAv, BT, and DINO; experiments show that the DINO with ViT-S as a backbone performs the best. The video-level BofN meta-tracker outperforms, by a large margin, existing SOTA trackers on nine standard benchmarks - LaSOT, TrackingNet, GOT-10K, VOT2019, VOT2021, VOT2022, UAV123, OTB100, and WebUAV-3M. Further improvement is achieved by the frame-level BofN meta-tracker effectively handling variations in the tracking scenarios within long sequences. For instance, on GOT-10k, BofN meta-tracker average overlap is 88.7% and 91.1% with video and frame-level settings respectively. The best performing tracker, RTS, achieves 85.20% AO. On VOT2022, BofN expected average overlap is 67.88% and 70.98% with video and frame level settings, compared to the best performing ARTrack, 64.12%. This work also presents an extensive evaluation of competitive tracking methods on all commonly used benchmarks, following their protocols. The code, the trained models, and the results will soon be made publicly available on https://github.com/BasitAlawode/Best_of_N_Trackers.

翻译：我们观察到，最先进的视觉跟踪器在不同视频属性和数据集上的性能差异惊人地显著。没有一个单一的跟踪器能在所有跟踪属性和数据集上始终保持最佳表现。为弥补这一差距，针对给定视频序列，我们预测"N个跟踪器中的最佳者"，称为BofN元跟踪器。其核心是一个跟踪性能预测网络，仅使用初始几帧即可为给定视频序列选择预测性能最佳的视觉跟踪器。我们还提出了一种帧级BofN元跟踪器，该跟踪器会按固定时间间隔持续预测最佳跟踪器。TP2N基于自监督学习架构MocoV2、SwAv、BT和DINO构建；实验表明，以ViT-S为骨干网络的DINO架构表现最佳。视频级BofN元跟踪器在九个标准基准测试——LaSOT、TrackingNet、GOT-10K、VOT2019、VOT2021、VOT2022、UAV123、OTB100和WebUAV-3M上，以显著优势超越现有最先进的跟踪器。帧级BofN元跟踪器通过有效处理长序列中跟踪场景的变化，实现了进一步的性能提升。例如在GOT-10k数据集上，BofN元跟踪器在视频级和帧级设置下的平均重叠率分别为88.7%和91.1%，而表现最佳的RTS跟踪器仅达到85.20%的平均重叠率。在VOT2022数据集上，BofN的期望平均重叠率在视频级和帧级设置下分别为67.88%和70.98%，相比之下表现最佳的ARTrack跟踪器为64.12%。本研究还依据标准协议，对所有常用基准测试上的竞争性跟踪方法进行了全面评估。代码、训练模型及结果将公开发布于https://github.com/BasitAlawode/Best_of_N_Trackers。