This paper proposes a Robust and Efficient Memory Network, referred to as REMN, for studying semi-supervised video object segmentation (VOS). Memory-based methods have recently achieved outstanding VOS performance by performing non-local pixel-wise matching between the query and memory. However, these methods have two limitations. 1) Non-local matching could cause distractor objects in the background to be incorrectly segmented. 2) Memory features with high temporal redundancy consume significant computing resources. For limitation 1, we introduce a local attention mechanism that tackles the background distraction by enhancing the features of foreground objects with the previous mask. For limitation 2, we first adaptively decide whether to update the memory features depending on the variation of foreground objects to reduce temporal redundancy. Second, we employ a dynamic memory bank, which uses a lightweight and differentiable soft modulation gate to decide how many memory features need to be removed in the temporal dimension. Experiments demonstrate that our REMN achieves state-of-the-art results on DAVIS 2017, with a $\mathcal{J\&F}$ score of 86.3% and on YouTube-VOS 2018, with a $\mathcal{G}$ over mean of 85.5%. Furthermore, our network shows a high inference speed of 25+ FPS and uses relatively few computing resources.
翻译:本文提出一种鲁棒高效记忆网络(简称REMN),用于研究半监督视频对象分割(VOS)。基于记忆的方法通过在执行查询与记忆之间的非局部逐像素匹配,近期在VOS任务中取得了优异性能。然而,这些方法存在两个局限性:1)非局部匹配可能导致背景中的干扰对象被错误分割;2)具有高时间冗余性的记忆特征消耗大量计算资源。针对局限性1,我们引入局部注意力机制,通过利用前一帧掩码增强前景对象特征来抑制背景干扰;针对局限性2,我们首先根据前景对象的变化自适应决定是否更新记忆特征以减少时间冗余,其次采用动态记忆库,通过轻量级可微软调制门控决定在时间维度上需移除的记忆特征数量。实验表明,我们的REMN在DAVIS 2017数据集上取得具有$\mathcal{J\&F}$分数86.3%的先进结果,在YouTube-VOS 2018数据集上取得$\mathcal{G}$平均指标85.5%。此外,该网络在保持较高推理速度(25+ FPS)的同时仅需较少计算资源。