Visual object tracking is a fundamental video task in computer vision. Recently, the notably increasing power of perception algorithms allows the unification of single/multiobject and box/mask-based tracking. Among them, the Segment Anything Model (SAM) attracts much attention. In this report, we propose HQTrack, a framework for High Quality Tracking anything in videos. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask refiner (MR). Given the object to be tracked in the initial frame of a video, VMOS propagates the object masks to the current frame. The mask results at this stage are not accurate enough since VMOS is trained on several closeset video object segmentation (VOS) datasets, which has limited ability to generalize to complex and corner scenes. To further improve the quality of tracking masks, a pretrained MR model is employed to refine the tracking results. As a compelling testament to the effectiveness of our paradigm, without employing any tricks such as test-time data augmentations and model ensemble, HQTrack ranks the 2nd place in the Visual Object Tracking and Segmentation (VOTS2023) challenge. Code and models are available at https://github.com/jiawen-zhu/HQTrack.
翻译:视觉目标跟踪是计算机视觉中的一项基础视频任务。近年来,感知算法能力的显著增强使得单/多目标与框/掩码类跟踪的统一成为可能。其中,分割任意模型(SAM)备受关注。本报告提出HQTrack——一种面向视频中高质量任意目标跟踪的框架。HQTrack主要由视频多目标分割器(VMOS)和掩码精炼器(MR)组成。给定视频初始帧中待跟踪目标后,VMOS将目标掩码传播至当前帧。由于VMOS在多个封闭式视频目标分割(VOS)数据集上训练,其对复杂与边缘场景的泛化能力有限,此阶段的掩码结果精度不足。为进一步提升跟踪掩码质量,采用预训练MR模型对跟踪结果进行精炼。作为本范式有效性的有力佐证,未采用测试时数据增强与模型集成等技巧的情况下,HQTrack在视觉目标跟踪与分割挑战赛(VOTS2023)中位列第二。代码与模型已开源至https://github.com/jiawen-zhu/HQTrack。