Despite the recent advances in unified image segmentation (IS), developing a unified video segmentation (VS) model remains a challenge. This is mainly because generic category-specified VS tasks need to detect all objects and track them across consecutive frames, while prompt-guided VS tasks require re-identifying the target with visual/text prompts throughout the entire video, making it hard to handle the different tasks with the same architecture. We make an attempt to address these issues and present a novel unified VS architecture, namely UniVS, by using prompts as queries. UniVS averages the prompt features of the target from previous frames as its initial query to explicitly decode masks, and introduces a target-wise prompt cross-attention layer in the mask decoder to integrate prompt features in the memory pool. By taking the predicted masks of entities from previous frames as their visual prompts, UniVS converts different VS tasks into prompt-guided target segmentation, eliminating the heuristic inter-frame matching process. Our framework not only unifies the different VS tasks but also naturally achieves universal training and testing, ensuring robust performance across different scenarios. UniVS shows a commendable balance between performance and universality on 10 challenging VS benchmarks, covering video instance, semantic, panoptic, object, and referring segmentation tasks. Code can be found at \url{https://github.com/MinghanLi/UniVS}.
翻译:尽管近期统一图像分割(IS)领域取得了进展,但开发统一的视频分割(VS)模型仍面临挑战。这主要是因为通用类别指定的VS任务需要检测所有目标并在连续帧间跟踪,而提示引导的VS任务则要求在整段视频中通过视觉/文本提示重新识别目标,使得难以用相同架构处理不同任务。我们尝试解决这些问题,提出一种新颖的统一VS架构——UniVS,采用提示作为查询机制。UniVS将前序帧中目标的提示特征取均值作为初始查询以显式解码掩码,并在掩码解码器中引入目标感知的提示交叉注意力层,整合记忆池中的提示特征。通过将前序帧预测的实体掩码作为其视觉提示,UniVS将不同VS任务转化为提示引导的目标分割,消除了启发式的帧间匹配过程。该框架不仅统一了不同VS任务,还自然实现了统一训练与测试,确保在多样化场景下的稳健性能。UniVS在涵盖视频实例分割、语义分割、全景分割、目标分割及指代分割的10个挑战性VS基准上,展现了性能与通用性之间的良好平衡。代码见\url{https://github.com/MinghanLi/UniVS}。