We present Cutie, a video object segmentation (VOS) network with object-level memory reading, which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise, especially in the presence of distractors, resulting in lower performance in more challenging data. In contrast, Cutie performs top-down object-level memory reading by adapting a small set of object queries for restructuring and interacting with the bottom-up pixel features iteratively with a query-based object transformer (qt, hence Cutie). The object queries act as a high-level summary of the target object, while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention, Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset, Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while running three times as fast. Code is available at: https://hkchengrex.github.io/Cutie
翻译:我们提出了Cutie,一种具有目标级记忆读取的视频目标分割(VOS)网络,该网络将记忆中的目标表征重新融入视频目标分割结果。近期VOS相关工作采用自底向上的像素级记忆读取方法,这种方法因匹配噪声而面临挑战,尤其在存在干扰物的情况下,导致在更具挑战性的数据上性能下降。相比之下,Cutie通过自顶向下的目标级记忆读取,采用少量目标查询(object queries)进行重组,并利用基于查询的目标变换器(query-based object transformer,简称qt,故得名Cutie)与自底向上的像素特征进行迭代交互。目标查询充当目标对象的高级摘要,同时保留高分辨率特征图以实现精确分割。结合前景-背景掩码注意力机制,Cutie能够清晰分离前景目标与背景的语义信息。在具有挑战性的MOSE数据集上,Cutie在相近运行时间内较XMem提升8.7的J&F指标,较DeAOT提升4.2的J&F指标且运行速度快三倍。代码开源地址:https://hkchengrex.github.io/Cutie