Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.
翻译:视频追踪的目标是在给定目标初始状态的情况下,在后续帧中找到特定目标。由于不同任务中目标状态的粒度各异,现有的大多数追踪器都针对单一任务定制,并严重依赖各任务内部自定义设计的模块,这限制了其泛化能力,并导致模型设计和参数上的冗余。为了统一视频追踪任务,我们提出了 SAM 2++,这是一个面向任意粒度(包括掩码、边界框和点)追踪的统一模型。首先,为了扩展目标粒度,我们设计了任务特定的提示,将各种任务输入编码为通用的提示嵌入,并设计了一个统一的解码器,将多样化的任务结果统一为一种预输出形式。接着,为了满足追踪的核心操作——记忆匹配,我们引入了一种任务自适应的记忆机制,统一了不同粒度下的记忆。最后,我们引入了一个定制化的数据引擎,以支持任意粒度的追踪训练,生成了一个规模庞大、种类多样且包含三种粒度丰富标注的视频追踪数据集,命名为 Tracking-Any-Granularity,这为统一追踪的训练和基准测试提供了一个全面的资源。在多个基准测试上的综合实验证实,SAM 2++ 在不同粒度的多样化追踪任务中均达到了新的最优性能,建立了一个统一且鲁棒的追踪框架。