In this paper, we introduce a variant of video object segmentation (VOS) that bridges interactive and semi-automatic approaches, termed Lazy Video Object Segmentation (ziVOS). In contrast, to both tasks, which handle video object segmentation in an off-line manner (i.e., pre-recorded sequences), we propose through ziVOS to target online recorded sequences. Here, we strive to strike a balance between performance and robustness for long-term scenarios by soliciting user feedback's on-the-fly during the segmentation process. Hence, we aim to maximize the tracking duration of an object of interest, while requiring minimal user corrections to maintain tracking over an extended period. We propose a competitive baseline, i.e., Lazy-XMem, as a reference for future works in ziVOS. Our proposed approach uses an uncertainty estimation of the tracking state to determine whether a user interaction is necessary to refine the model's prediction. To quantitatively assess the performance of our method and the user's workload, we introduce complementary metrics alongside those already established in the field. We evaluate our approach using the recently introduced LVOS dataset, which offers numerous long-term videos. Our code is publicly available at https://github.com/Vujas-Eteph/LazyXMem.
翻译:本文提出了一种融合交互式与半自动方法的视频目标分割变体,称为惰性视频目标分割(ziVOS)。与以往以离线方式(即处理预录制序列)处理视频目标分割的两类任务不同,我们通过ziVOS旨在处理在线录制的视频序列。在此框架下,我们力求在长时场景中实现性能与鲁棒性之间的平衡,其核心是在分割过程中实时征询用户反馈。因此,我们的目标是在尽可能延长目标物体跟踪时长的同时,仅需极少的用户修正即可维持长期稳定的跟踪。我们提出了一个具有竞争力的基线方法——Lazy-XMem,作为未来ziVOS研究的参考基准。该方法通过估计跟踪状态的不确定性,动态判断是否需要用户交互来优化模型预测。为量化评估方法性能与用户工作量,我们在领域既有指标基础上引入了一套补充性度量标准。我们使用新近发布的LVOS数据集(包含大量长时视频)对所提方法进行了评估。代码已公开于https://github.com/Vujas-Eteph/LazyXMem。