Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference speed of 32+ FPS.
翻译:近期,视频对象分割(VOS)网络通常采用基于记忆的方法:对于每个查询帧,通过时空匹配与记忆帧来预测掩码。尽管这些方法性能优越,但仍面临两个问题:1)困难数据可能破坏相邻视频帧之间的时空连贯性。2)像素级匹配会导致由噪声或干扰因素引起的不良误匹配。为解决上述问题,我们首先提出在相邻帧之间生成辅助帧,作为查询帧的隐式短期时间参考。接着,我们为每个视频对象学习原型,并能在查询帧与记忆帧之间实现原型级匹配。实验表明,我们的网络在DAVIS 2017数据集上优于最先进方法,取得86.4%的J&F分数,并在YouTube VOS 2018数据集上达到具有竞争力的85.0%结果。此外,我们的网络展现出32+ FPS的高推理速度。