Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT) pipelines. Unfortunately, they lack two key properties: objects are often split into parts and are not consistently tracked over time. In fact, state-of-the-art models achieve pixel-level accuracy and temporal consistency by relying on supervised object detection with additional ID labels for the association through time. This paper proposes a video object-centric model for MOT. It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module that builds complete object prototypes to handle occlusions. Benefited from object-centric learning, we only require sparse detection labels (0%-6.25%) for object localization and feature binding. Relying on our self-supervised Expectation-Maximization-inspired loss for object association, our approach requires no ID labels. Our experiments significantly narrow the gap between the existing object-centric model and the fully supervised state-of-the-art and outperform several unsupervised trackers.
翻译:无监督的基于对象中心的学习方法能够将场景分割为实体,而无需额外的定位信息,因此是减轻多目标跟踪(MOT)流程标注负担的极佳候选方案。然而,这些方法缺乏两个关键特性:对象常被分割成多个部分,且无法实现跨时间的持续跟踪。事实上,现有最优模型通过依赖监督式目标检测及跨时间关联所需的额外身份标签,才实现了像素级精度和时间一致性。本文提出一种面向视频的基于对象中心的多目标跟踪模型。该模型包含一个索引合并模块,用于将基于对象中心的槽位适配为检测输出,以及一个对象记忆模块,用于构建完整对象原型以处理遮挡问题。得益于对象中心学习,我们仅需稀疏的检测标签(0%-6.25%)即可完成目标定位和特征绑定。同时,基于我们受期望最大化启发的自监督损失函数进行对象关联,我们的方法无需任何身份标签。实验结果表明,本方法显著缩小了现有对象中心模型与全监督最优方法之间的差距,并超越了多种无监督跟踪器。