Video object segmentation (VOS) aims to distinguish and track target objects in a video. Despite the excellent performance achieved by off-the-shell VOS models, existing VOS benchmarks mainly focus on short-term videos lasting about 5 seconds, where objects remain visible most of the time. However, these benchmarks poorly represent practical applications, and the absence of long-term datasets restricts further investigation of VOS in realistic scenarios. Thus, we propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last 1.14 minutes on average, approximately 5 times longer than videos in existing datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objects. Compared to previous benchmarks, our LVOS better reflects VOS models' performance in real scenarios. Based on LVOS, we evaluate 20 existing VOS models under 4 different settings and conduct a comprehensive analysis. On LVOS, these models suffer a large performance drop, highlighting the challenge of achieving precise tracking and segmentation in real-world scenarios. Attribute-based analysis indicates that key factor to accuracy decline is the increased video length, emphasizing LVOS's crucial role. We hope our LVOS can advance development of VOS in real scenes. Data and code are available at https://lingyihongfd.github.io/lvos.github.io/.
翻译:视频目标分割旨在区分并跟踪视频中的目标物体。尽管现有VOS模型已取得卓越性能,但现有基准数据集主要聚焦于持续约5秒的短时视频,其中目标物体在大部分时间内保持可见。然而,这些基准难以反映实际应用场景,且长时数据集的缺失限制了对真实场景下VOS的深入研究。为此,我们提出名为LVOS的新基准数据集,包含720个视频、296,401帧图像及407,945个高质量标注。LVOS中视频平均时长为1.14分钟,约为现有数据集的5倍。每个视频包含多种属性,尤其引入了野外场景特有的挑战,如长时重现与跨时间相似目标。相较于先前基准,LVOS能更好地反映VOS模型在真实场景中的性能。基于LVOS,我们在4种不同设置下评估了20个现有VOS模型并开展综合分析。在LVOS上,这些模型均出现显著性能下降,凸显了在真实场景中实现精准跟踪与分割的挑战。基于属性的分析表明,精度下降的关键因素是视频时长增加,进一步验证了LVOS的重要性。我们期望LVOS能推动真实场景VOS技术的发展。数据与代码已发布于https://lingyihongfd.github.io/lvos.github.io/。