Unsupervised video object segmentation (VOS) is a task that aims to detect the most salient object in a video without external guidance about the object. To leverage the property that salient objects usually have distinctive movements compared to the background, recent methods collaboratively use motion cues extracted from optical flow maps with appearance cues extracted from RGB images. However, as optical flow maps are usually very relevant to segmentation masks, the network is easy to be learned overly dependent on the motion cues during network training. As a result, such two-stream approaches are vulnerable to confusing motion cues, making their prediction unstable. To relieve this issue, we design a novel motion-as-option network by treating motion cues as optional. During network training, RGB images are randomly provided to the motion encoder instead of optical flow maps, to implicitly reduce motion dependency of the network. As the learned motion encoder can deal with both RGB images and optical flow maps, two different predictions can be generated depending on which source information is used as motion input. In order to fully exploit this property, we also propose an adaptive output selection algorithm to adopt optimal prediction result at test time. Our proposed approach affords state-of-the-art performance on all public benchmark datasets, even maintaining real-time inference speed.
翻译:无监督视频对象分割(VOS)是一项旨在无需外部目标引导的情况下,检测视频中最显著对象的任务。利用显著对象通常具有与背景不同的独特运动特性,近期方法协同使用从光流图提取的运动线索和从RGB图像提取的外观线索。然而,由于光流图通常与分割掩码高度相关,网络在训练过程中容易过度依赖运动线索,导致此类双流方法容易受混淆的运动线索影响,使其预测不稳定。为缓解这一问题,我们设计了一种新颖的运动作为选项网络,将运动线索视为可选项。在网络训练期间,RGB图像被随机提供给运动编码器而非光流图,以隐式降低网络对运动的依赖性。由于训练后的运动编码器可同时处理RGB图像和光流图,根据使用何种源信息作为运动输入,可生成两种不同的预测结果。为充分利用这一特性,我们还提出了一种自适应输出选择算法,在测试时采用最优预测结果。我们的方法在所有公开基准数据集上均达到了最先进的性能,同时保持了实时推理速度。