Storing intermediate frame segmentations as memory for long-range context modeling, spatial-temporal memory-based methods have recently showcased impressive results in semi-supervised video object segmentation (SVOS). However, these methods face two key limitations: 1) relying on non-local pixel-level matching to read memory, resulting in noisy retrieved features for segmentation; 2) segmenting each object independently without interaction. These shortcomings make the memory-based methods struggle in similar object and multi-object segmentation. To address these issues, we propose a query modulation method, termed QMVOS. This method summarizes object features into dynamic queries and then treats them as dynamic filters for mask prediction, thereby providing high-level descriptions and object-level perception for the model. Efficient and effective multi-object interactions are realized through inter-query attention. Extensive experiments demonstrate that our method can bring significant improvements to the memory-based SVOS method and achieve competitive performance on standard SVOS benchmarks. The code is available at https://github.com/zht8506/QMVOS.
翻译:存储中间帧分割结果作为长期上下文建模的记忆,基于时空记忆的方法在半监督视频对象分割(SVOS)中展现出令人瞩目的成果。然而,这些方法面临两个关键局限:1)依赖非局部像素级匹配来读取记忆,导致分割时检索到的特征存在噪声;2)对每个对象进行独立分割,缺乏交互。这些缺陷使得基于记忆的方法在相似对象和多对象分割中表现困难。为解决这些问题,我们提出一种查询调制方法,称为QMVOS。该方法将对象特征汇总为动态查询,并将其作为动态滤波器进行掩码预测,从而为模型提供高层次的描述和对象级感知。通过查询间注意力机制实现高效有效的多对象交互。大量实验表明,我们的方法能够显著提升基于记忆的SVOS方法,并在标准SVOS基准测试中取得具有竞争力的性能。代码可在https://github.com/zht8506/QMVOS获取。