We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method capitalizes on visual-language representations learned by video diffusion models on Internet-scale datasets. A key insight of our approach is preserving as much of the generative model's original representation as possible, while fine-tuning it on narrow-domain Referral Object Segmentation datasets. As a result, our framework can accurately segment and track rare and unseen objects, despite being trained on object masks from a limited set of categories. Additionally, it can generalize to non-object dynamic concepts, such as waves crashing in the ocean, as demonstrated in our newly introduced benchmark for Referral Video Process Segmentation (Ref-VPS). Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to twelve points in terms of region similarity on out-of-domain data, leveraging the power of Internet-scale pre-training.
翻译:本文提出REM框架,用于分割视频中可通过自然语言描述的广泛概念。我们的方法充分利用了视频扩散模型在互联网规模数据集上学到的视觉-语言表征。该框架的核心洞见在于:在保持生成模型原始表征完整性的同时,针对窄域参考目标分割数据集进行微调。因此,尽管仅使用有限类别目标掩码进行训练,我们的框架仍能准确分割并追踪罕见及未见过的目标。此外,该方法可泛化至非目标动态概念(如海浪拍岸),这在我们新提出的参考视频过程分割基准(Ref-VPS)中得到了验证。实验表明,REM在Ref-DAVIS等域内数据集上达到与前沿方法相当的性能,同时借助互联网规模预训练的优势,在域外数据的区域相似度指标上领先最高达12个百分点。