Generic Object Tracking (GOT) is the problem of tracking target objects, specified by bounding boxes in the first frame of a video. While the task has received much attention in the last decades, researchers have almost exclusively focused on the single object setting. Multi-object GOT benefits from a wider applicability, rendering it more attractive in real-world applications. We attribute the lack of research interest into this problem to the absence of suitable benchmarks. In this work, we introduce a new large-scale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence. Our benchmark allows users to tackle key remaining challenges in GOT, aiming to increase robustness and reduce computation through joint tracking of multiple objects simultaneously. In addition, we propose a transformer-based GOT tracker baseline capable of joint processing of multiple objects through shared computation. Our approach achieves a 4x faster run-time in case of 10 concurrent objects compared to tracking each object independently and outperforms existing single object trackers on our new benchmark. In addition, our approach achieves highly competitive results on single-object GOT datasets, setting a new state of the art on TrackingNet with a success rate AUC of 84.4%. Our benchmark, code, and trained models will be made publicly available.
翻译:通用物体追踪(GOT)旨在追踪视频首帧中通过边界框指定的目标物体。尽管该任务在过去几十年间受到广泛关注,但研究者几乎全部聚焦于单物体场景。多物体GOT因具有更广泛的应用性,在现实场景中更具吸引力。我们将该问题缺乏研究关注归因于缺乏合适的基准数据集。本文提出新型大规模GOT基准LaGOT,每个序列包含多个标注目标物体。该基准允许用户攻克GOT中剩余的关键挑战,旨在通过多物体联合追踪提高鲁棒性并降低计算量。此外,我们提出基于Transformer的GOT追踪基线模型,通过共享计算实现多物体联合处理。在处理10个并发物体时,该方法相比独立追踪每个物体实现了4倍运行加速,并在新基准上超越现有单物体追踪器。在单物体GOT数据集上,我们的方法取得极具竞争力的结果,在TrackingNet上以84.4%的AUC成功率刷新了当前最优性能。我们的基准、代码及预训练模型将公开发布。