Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV-VSS's zero-shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS's effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.
翻译:视频语义分割一直是近期研究的焦点。然而,现有模型在面对陌生类别时面临挑战。为解决此问题,我们引入了开放词汇视频语义分割任务,旨在精确分割广泛开放词汇类别中的每个像素,包括新颖或先前未探索的类别。为提升OV-VSS性能,我们提出了一个鲁棒的基线模型OV2VSS,它集成了一个时空融合模块,使模型能够利用连续帧间的时间关系。此外,我们引入了一个随机帧增强模块,扩展了模型对整个视频序列中语义上下文的理解。我们的方法还包括视频文本编码,增强了模型在视频上下文中解释文本信息的能力。在VSPW和Cityscapes等基准数据集上的全面评估突显了OV-VSS的零样本泛化能力,尤其是在处理新颖类别方面。结果验证了OV2VSS的有效性,展示了其在多样化视频数据集上语义分割任务中性能的提升。