Tracking any given object(s) spatially and temporally is a common purpose in Visual Object Tracking (VOT) and Video Object Segmentation (VOS). Joint tracking and segmentation have been attempted in some studies but they often lack full compatibility of both box and mask in initialization and prediction, and mainly focus on single-object scenarios. To address these limitations, this paper proposes a Multi-object Mask-box Integrated framework for unified Tracking and Segmentation, dubbed MITS. Firstly, the unified identification module is proposed to support both box and mask reference for initialization, where detailed object information is inferred from boxes or directly retained from masks. Additionally, a novel pinpoint box predictor is proposed for accurate multi-object box prediction, facilitating target-oriented representation learning. All target objects are processed simultaneously from encoding to propagation and decoding, as a unified pipeline for VOT and VOS. Experimental results show MITS achieves state-of-the-art performance on both VOT and VOS benchmarks. Notably, MITS surpasses the best prior VOT competitor by around 6% on the GOT-10k test set, and significantly improves the performance of box initialization on VOS benchmarks. The code is available at https://github.com/yoxu515/MITS.
翻译:在视觉目标跟踪(VOT)和视频目标分割(VOS)中,对任意给定对象进行时空追踪是常见目标。部分研究尝试了联合跟踪与分割,但通常在初始化和预测过程中未能完全兼容框与掩码两种形式,且主要聚焦于单目标场景。针对这些局限性,本文提出了一种多目标掩码-框集成框架(MITS),用于统一的跟踪与分割。首先,设计了统一标识模块以支持基于框或掩码的初始化,从中推断或直接保留详细目标信息。此外,提出了一种新型精确定位框预测器,用于实现多目标框的精准预测,促进面向目标表征学习。从编码、传播到解码阶段,所有目标对象均被同步处理,形成适用于VOT和VOS的统一流水线。实验结果表明,MITS在VOT和VOS基准上均达到了最先进性能。值得注意的是,在GOT-10k测试集上,MITS相较此前最优VOT方法性能提升约6%,并在VOS基准上显著改善了框初始化的效果。代码已开源至https://github.com/yoxu515/MITS。