Multi-object tracking has a heavy-tailed difficulty distribution: most frames are easy for a lightweight base tracker, while a small fraction are intrinsically hard. Video object segmentation (VOS) models can often preserve identity through the hard frames where the base tracker fails, but they are much more expensive in compute and memory. We propose selective mask propagation, a tracking algorithm that dispatches from a base tracker to a VOS model only on windows where an assignment-uncertainty signal fires. The base tracker's output is modified only when the VOS model makes a confident prediction that contradicts the base tracker's identity assignment; weak or inconclusive predictions preserve the base output. The method is training-free, treats both the base tracker and the VOS model as black boxes, and can benefit from replacing the VOS component with a more capable model. On DanceTrack, selective mask propagation improves three different base trackers. On SportsMOT, where identity preservation is central to sports analytics, SAM3-Deep-EIoU with global track association achieves state-of-the-art performance on the benchmark with 86.8 HOTA.
翻译:多目标跟踪任务存在重尾难度分布:绝大多数帧对轻量级基础跟踪器是容易的,而少量帧则本质困难。视频目标分割模型通常能在基础跟踪器失效的困难帧中保持身份一致性,但其计算和内存开销显著更高。我们提出选择性掩码传播算法,该算法仅在触发赋值不确定性信号的窗口上,从基础跟踪器切换至视频目标分割模型。仅当视频目标分割模型做出的置信预测与基础跟踪器的身份赋值相矛盾时,基础跟踪器的输出才被修改;若预测较弱或不确定则保留基础输出。该方法无需训练,将基础跟踪器和视频目标分割模型均视为黑箱,且可通过替换为更强大的视频目标分割模型来提升性能。在DanceTrack数据集上,选择性掩码传播改进了三种不同基础跟踪器。在身份保持性对运动分析至关重要的SportsMOT数据集中,结合全局轨迹关联的SAM3-Deep-EIoU以86.8 HOTA达到了基准测试的最优性能。