Recent advancements in video semantic segmentation have made substantial progress by exploiting temporal correlations. Nevertheless, persistent challenges, including redundant computation and the reliability of the feature propagation process, underscore the need for further innovation. In response, we present Deep Common Feature Mining (DCFM), a novel approach strategically designed to address these challenges by leveraging the concept of feature sharing. DCFM explicitly decomposes features into two complementary components. The common representation extracted from a key-frame furnishes essential high-level information to neighboring non-key frames, allowing for direct re-utilization without feature propagation. Simultaneously, the independent feature, derived from each video frame, captures rapidly changing information, providing frame-specific clues crucial for segmentation. To achieve such decomposition, we employ a symmetric training strategy tailored for sparsely annotated data, empowering the backbone to learn a robust high-level representation enriched with common information. Additionally, we incorporate a self-supervised loss function to reinforce intra-class feature similarity and enhance temporal consistency. Experimental evaluations on the VSPW and Cityscapes datasets demonstrate the effectiveness of our method, showing a superior balance between accuracy and efficiency.
翻译:近期,视频语义分割领域通过利用时间相关性取得了显著进展。然而,冗余计算与特征传播过程的可靠性等持续存在的挑战,凸显了进一步创新的必要性。为此,我们提出深度公共特征挖掘(DCFM)这一新颖方法,通过利用特征共享的策略性设计来应对这些挑战。DCFM将特征明确分解为两个互补组件:从关键帧提取的公共表征为邻近非关键帧提供必要的高级语义信息,可直接重用而无需特征传播;同时,从每个视频帧推导的独立特征捕获快速变化的信息,提供分割所需的帧特定线索。为实现这种分解,我们采用针对稀疏标注数据定制的对称训练策略,使骨干网络能够学习富含公共信息的鲁棒高级表征。此外,我们引入自监督损失函数以增强类内特征相似性并提升时间一致性。在VSPW和Cityscapes数据集上的实验评估证明了该方法的有效性,其在精度与效率之间展现出优越平衡。