Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features. We discovered that this causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation. This negatively affects the ability of segmentation kernels. To address the drift problem, we propose a Spectrum-guided Multi-granularity (SgMg) approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks. In addition, we propose Spectrum-guided Cross-modal Fusion (SCF) to perform intra-frame global interactions in the spectral domain for effective multimodal representation. Finally, we extend SgMg to perform multi-object R-VOS, a new paradigm that enables simultaneous segmentation of multiple referred objects in a video. This not only makes R-VOS faster, but also more practical. Extensive experiments show that SgMg achieves state-of-the-art performance on four video benchmark datasets, outperforming the nearest competitor by 2.8% points on Ref-YouTube-VOS. Our extended SgMg enables multi-object R-VOS, runs about 3 times faster while maintaining satisfactory performance. Code is available at https://github.com/bo-miao/SgMg.
翻译:当前的指代视频目标分割(R-VOS)技术从编码后的(低分辨率)视觉-语言特征中提取条件核,以分割解码后的高分辨率特征。我们发现这会导致显著的特征漂移,而分割核在前向计算过程中难以感知这种漂移,进而对分割核的性能产生负面影响。为解决该漂移问题,我们提出频谱引导的多粒度(SgMg)方法,该方法直接在编码特征上进行分割,并利用视觉细节进一步优化掩码。此外,我们提出频谱引导的跨模态融合(SCF)机制,在频谱域内实现帧内全局交互,以构建高效的多模态表征。最后,我们将SgMg扩展至多目标R-VOS这一新范式,支持视频中多个指代目标的同时分割。这不仅提升了R-VOS的运行速度,也增强了其实用性。大量实验表明,SgMg在四个视频基准数据集上取得最优性能,在Ref-YouTube-VOS数据集上以2.8%的准确率优势超越最接近的竞争者。扩展后的SgMg可支持多目标R-VOS,在保持满意性能的同时实现约3倍加速。代码已开源在https://github.com/bo-miao/SgMg。