To alleviate the expensive human labeling, semi-supervised semantic segmentation employs a few labeled images and an abundant of unlabeled images to predict the pixel-level label map with the same size. Previous methods often adopt co-training using two convolutional networks with the same architecture but different initialization, which fails to capture the sufficiently diverse features. This motivates us to use tri-training and develop the triple-view encoder to utilize the encoders with different architectures to derive diverse features, and exploit the knowledge distillation skill to learn the complementary semantics among these encoders. Moreover, existing methods simply concatenate the features from both encoder and decoder, resulting in redundant features that require large memory cost. This inspires us to devise a dual-frequency decoder that selects those important features by projecting the features from the spatial domain to the frequency domain, where the dual-frequency channel attention mechanism is introduced to model the feature importance. Therefore, we propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation, including the triple-view encoder and the dual-frequency decoder. Extensive experiments were conducted on two benchmarks, \ie, Pascal VOC 2012 and Cityscapes, whose results verify the superiority of the proposed method with a good tradeoff between precision and inference speed.
翻译:为减轻昂贵的人工标注成本,半监督语义分割利用少量标注图像与大量无标注图像预测相同尺寸的像素级标签图。现有方法通常采用双卷积网络协同训练,虽具有相同架构但初始化不同,却难以捕获足够多样化的特征。这促使我们采用三训练机制并开发三视图编码器——通过不同架构的编码器提取多样化特征,并利用知识蒸馏技术学习这些编码器间的互补语义。此外,现有方法简单拼接编码器与解码器的特征,导致特征冗余且消耗大量内存。这启发我们设计双频解码器,通过将特征从空间域映射至频域来选取重要特征,并引入双频通道注意力机制建模特征重要度。据此,我们提出面向半监督语义分割的三视图知识蒸馏框架TriKD,包含三视图编码器与双频解码器。在Pascal VOC 2012和Cityscapes两个基准数据集上的大量实验验证了该方法在精度与推理速度间取得良好平衡的优越性。