Contextual information plays a core role for video semantic segmentation (VSS). This paper summarizes contexts for VSS in two-fold: local temporal contexts (LTC) which define the contexts from neighboring frames, and global temporal contexts (GTC) which represent the contexts from the whole video. As for LTC, it includes static and motional contexts, corresponding to static and moving content in neighboring frames, respectively. Previously, both static and motional contexts have been studied. However, there is no research about simultaneously learning static and motional contexts (highly complementary). Hence, we propose a Coarse-to-Fine Feature Mining (CFFM) technique to learn a unified presentation of LTC. CFFM contains two parts: Coarse-to-Fine Feature Assembling (CFFA) and Cross-frame Feature Mining (CFM). CFFA abstracts static and motional contexts, and CFM mines useful information from nearby frames to enhance target features. To further exploit more temporal contexts, we propose CFFM++ by additionally learning GTC from the whole video. Specifically, we uniformly sample certain frames from the video and extract global contextual prototypes by k-means. The information within those prototypes is mined by CFM to refine target features. Experimental results on popular benchmarks demonstrate that CFFM and CFFM++ perform favorably against state-of-the-art methods. Our code is available at https://github.com/GuoleiSun/VSS-CFFM
翻译:上下文信息在视频语义分割(VSS)中扮演核心角色。本文从两方面总结VSS中的上下文:局部时间上下文(LTC)定义来自相邻帧的上下文,全局时间上下文(GTC)代表来自整个视频的上下文。对于LTC,它包含静态上下文和运动上下文,分别对应相邻帧中的静态和运动内容。此前,静态和运动上下文已被分别研究,但尚无研究同时学习这两种高度互补的上下文。因此,我们提出了一种粗到细特征挖掘(CFFM)技术来学习LTC的统一表示。CFFM包含两部分:粗到细特征组装(CFFA)和跨帧特征挖掘(CFM)。CFFA提炼静态和运动上下文,CFM从相邻帧中挖掘有用信息以增强目标特征。为进一步利用更多时间上下文,我们提出CFFM++,通过额外从整个视频中学习GTC。具体地,我们从视频中均匀采样若干帧,并通过k-means提取全局上下文原型,利用CFM挖掘这些原型中的信息以优化目标特征。在公开基准数据集上的实验结果表明,CFFM和CFFM++的性能优于现有最先进方法。我们的代码已开源:https://github.com/GuoleiSun/VSS-CFFM