Despite the recent advances in unified image segmentation (IS), developing a unified video segmentation (VS) model remains a challenge. This is mainly because generic category-specified VS tasks need to detect all objects and track them across consecutive frames, while prompt-guided VS tasks require re-identifying the target with visual/text prompts throughout the entire video, making it hard to handle the different tasks with the same architecture. We make an attempt to address these issues and present a novel unified VS architecture, namely UniVS, by using prompts as queries. UniVS averages the prompt features of the target from previous frames as its initial query to explicitly decode masks, and introduces a target-wise prompt cross-attention layer in the mask decoder to integrate prompt features in the memory pool. By taking the predicted masks of entities from previous frames as their visual prompts, UniVS converts different VS tasks into prompt-guided target segmentation, eliminating the heuristic inter-frame matching process. Our framework not only unifies the different VS tasks but also naturally achieves universal training and testing, ensuring robust performance across different scenarios. UniVS shows a commendable balance between performance and universality on 10 challenging VS benchmarks, covering video instance, semantic, panoptic, object, and referring segmentation tasks. Code can be found at \url{https://github.com/MinghanLi/UniVS}.
翻译:尽管统一图像分割(IS)近期取得了进展,但开发统一视频分割(VS)模型仍是一项挑战。其主要原因在于,通用类别指定的VS任务需要检测所有目标并在连续帧间进行跟踪,而提示引导的VS任务则需在整个视频中通过视觉/文本提示重新识别目标,这使得难以用同一架构处理不同任务。我们尝试解决这些问题,并提出了一种新的统一VS架构UniVS,该架构将提示作为查询使用。UniVS将前一帧中目标的提示特征取平均作为初始查询以显式解码掩码,并在掩码解码器中引入目标级提示交叉注意力层,以整合记忆池中的提示特征。通过将前一帧中实体的预测掩码作为其视觉提示,UniVS将不同的VS任务转化为提示引导的目标分割,消除了启发式的帧间匹配过程。我们的框架不仅统一了不同VS任务,还自然实现了通用训练与测试,确保了不同场景下的稳健性能。UniVS在涵盖视频实例、语义、全景、目标及指代分割任务的10个具有挑战性的VS基准上,展现了性能与通用性之间的良好平衡。代码见\url{https://github.com/MinghanLi/UniVS}。