Tracking any given object(s) spatially and temporally is a common purpose in Visual Object Tracking (VOT) and Video Object Segmentation (VOS). Joint tracking and segmentation have been attempted in some studies but they often lack full compatibility of both box and mask in initialization and prediction, and mainly focus on single-object scenarios. To address these limitations, this paper proposes a Multi-object Mask-box Integrated framework for unified Tracking and Segmentation, dubbed MITS. Firstly, the unified identification module is proposed to support both box and mask reference for initialization, where detailed object information is inferred from boxes or directly retained from masks. Additionally, a novel pinpoint box predictor is proposed for accurate multi-object box prediction, facilitating target-oriented representation learning. All target objects are processed simultaneously from encoding to propagation and decoding, as a unified pipeline for VOT and VOS. Experimental results show MITS achieves state-of-the-art performance on both VOT and VOS benchmarks. Notably, MITS surpasses the best prior VOT competitor by around 6% on the GOT-10k test set, and significantly improves the performance of box initialization on VOS benchmarks. The code is available at https://github.com/yoxu515/MITS.
翻译:在空间和时间上跟踪任意给定目标,是视觉目标跟踪(VOT)和视频目标分割(VOS)中的常见任务。部分研究已尝试联合跟踪与分割,但在初始化和预测过程中往往缺乏框和掩码的完全兼容性,且主要聚焦于单目标场景。为解决这些局限,本文提出了一种面向统一跟踪与分割的多目标掩码-框集成框架,称为MITS。首先,提出统一识别模块以支持基于框和掩码的初始化,其中详细的目标信息可由框推断得出,或直接从掩码中保留。此外,提出了一种新颖的精确框预测器,用于精确的多目标框预测,从而促进面向目标的表征学习。从编码至传播和译码,所有目标均被同步处理,形成VOT与VOS的统一流程。实验结果表明,MITS在VOT和VOS基准测试中均达到了最先进性能。值得注意的是,MITS在GOT-10k测试集上超越先前最优VOT竞争者约6%,并在VOS基准测试中显著提升了框初始化的性能。代码已开源在https://github.com/yoxu515/MITS。