Despite advancements in user-guided video segmentation, extracting complex objects consistently for highly complex scenes is still a labor-intensive task, especially for production. It is not uncommon that a majority of frames need to be annotated. We introduce a novel semi-supervised video object segmentation (SSVOS) model, XMem++, that improves existing memory-based models, with a permanent memory module. Most existing methods focus on single frame annotations, while our approach can effectively handle multiple user-selected frames with varying appearances of the same object or region. Our method can extract highly consistent results while keeping the required number of frame annotations low. We further introduce an iterative and attention-based frame suggestion mechanism, which computes the next best frame for annotation. Our method is real-time and does not require retraining after each user input. We also introduce a new dataset, PUMaVOS, which covers new challenging use cases not found in previous benchmarks. We demonstrate SOTA performance on challenging (partial and multi-class) segmentation scenarios as well as long videos, while ensuring significantly fewer frame annotations than any existing method.
翻译:尽管用户引导的视频分割技术取得了进展,但在高度复杂场景中一致地提取复杂对象仍是一项劳动密集型任务,尤其在影视制作中。通常需要标注大部分帧。我们提出了一种新型半监督视频对象分割(SSVOS)模型XMem++,该模型通过引入永久记忆模块改进了现有基于记忆的模型。现有方法大多聚焦于单帧标注,而我们的方法能有效处理同一对象或区域具有不同外观的多帧用户选择。该方法能在保持低帧标注数量的同时,提取高度一致的结果。我们进一步提出了一种基于迭代和注意力的帧建议机制,用于计算下一帧的最优标注位置。该方法支持实时处理,且无需在每次用户输入后重新训练。我们还引入了一个新数据集PUMaVOS,涵盖了此前基准测试中未出现的具有挑战性的新用例。我们展示了在具有挑战性(部分与多类别)分割场景及长视频上的SOTA性能,同时确保帧标注量远低于现有任何方法。