VastTrack: Vast Category Visual Object Tracking

In this paper, we introduce a novel benchmark, dubbed VastTrack, towards facilitating the development of more general visual tracking via encompassing abundant classes and videos. VastTrack possesses several attractive properties: (1) Vast Object Category. In particular, it covers target objects from 2,115 classes, largely surpassing object categories of existing popular benchmarks (e.g., GOT-10k with 563 classes and LaSOT with 70 categories). With such vast object classes, we expect to learn more general object tracking. (2) Larger scale. Compared with current benchmarks, VastTrack offers 50,610 sequences with 4.2 million frames, which makes it to date the largest benchmark regarding the number of videos, and thus could benefit training even more powerful visual trackers in the deep learning era. (3) Rich Annotation. Besides conventional bounding box annotations, VastTrack also provides linguistic descriptions for the videos. The rich annotations of VastTrack enables development of both the vision-only and the vision-language tracking. To ensure precise annotation, all videos are manually labeled with multiple rounds of careful inspection and refinement. To understand performance of existing trackers and to provide baselines for future comparison, we extensively assess 25 representative trackers. The results, not surprisingly, show significant drops compared to those on current datasets due to lack of abundant categories and videos from diverse scenarios for training, and more efforts are required to improve general tracking. Our VastTrack and all the evaluation results will be made publicly available https://github.com/HengLan/VastTrack.

翻译：本文提出一个新基准数据集VastTrack，旨在通过涵盖丰富的类别与视频数据，推动更通用视觉跟踪技术的发展。VastTrack具备以下显著特性：(1) 海量目标类别。该数据集覆盖2,115类目标对象，远超现有主流基准数据集（如含563类的GOT-10k及含70类的LaSOT）。基于如此庞大的目标类别，我们期望探索更通用的目标跟踪方法。(2) 大规模数据。与当前基准相比，VastTrack提供50,610个视频序列及420万帧图像，就视频数量而言是目前规模最大的基准数据集，因此可有效支持深度学习时代更强大视觉跟踪器的训练。(3) 丰富标注。除传统边界框标注外，VastTrack还提供视频的语言描述。这种丰富标注特性使其既能支持纯视觉跟踪，也能支撑视觉-语言跟踪研究。为确保标注精度，所有视频均经过多轮人工标注与严格复核修正。为评估现有跟踪器性能并提供未来比较基准，我们全面测试了25个代表性跟踪器。结果表明，由于缺乏多样化场景的丰富类别与视频训练数据，这些跟踪器的性能相较于在现有数据集上的表现出现显著下降，亟需更多研究来提升通用跟踪能力。本VastTrack数据集及所有评估结果将在https://github.com/HengLan/VastTrack 公开发布。