Increasing the annotation efficiency of trajectory annotations from videos has the potential to enable the next generation of data-hungry tracking algorithms to thrive on large-scale datasets. Despite the importance of this task, there are currently very few works exploring how to efficiently label tracking datasets comprehensively. In this work, we introduce SPAM, a video label engine that provides high-quality labels with minimal human intervention. SPAM is built around two key insights: i) most tracking scenarios can be easily resolved. To take advantage of this, we utilize a pre-trained model to generate high-quality pseudo-labels, reserving human involvement for a smaller subset of more difficult instances; ii) handling the spatiotemporal dependencies of track annotations across time can be elegantly and efficiently formulated through graphs. Therefore, we use a unified graph formulation to address the annotation of both detections and identity association for tracks across time. Based on these insights, SPAM produces high-quality annotations with a fraction of ground truth labeling cost. We demonstrate that trackers trained on SPAM labels achieve comparable performance to those trained on human annotations while requiring only $3-20\%$ of the human labeling effort. Hence, SPAM paves the way towards highly efficient labeling of large-scale tracking datasets. We release all models and code.
翻译:提升视频轨迹标注的效率,有望使下一代数据驱动的跟踪算法能够在大规模数据集上蓬勃发展。尽管这一任务至关重要,但目前鲜有研究系统探索如何高效地全面标注跟踪数据集。本文提出SPAM——一种以最少人工干预生成高质量标签的视频标注引擎。SPAM基于两个核心洞见:i) 多数跟踪场景可被轻松处理。为此,我们利用预训练模型生成高质量伪标签,仅对少量更困难的样本保留人工标注;ii) 跨时间轨迹标注的时空依赖关系可通过图结构进行优雅而高效的建模。因此,我们采用统一的图表示方法,同时处理检测框标注和跨时间轨迹的身份关联任务。基于这些设计,SPAM能以远低于真实标注成本的方式产出高质量标注。实验表明,使用SPAM标注训练的跟踪器性能与基于人工标注训练的模型相当,而所需人工标注量仅为后者的$3-20\%$。因此,SPAM为大规模跟踪数据集的极高效标注开辟了新路径。我们将公开所有模型与代码。