Traditional semantic segmentation tasks require a large number of labels and are difficult to identify unlearned categories. Few-shot semantic segmentation (FSS) aims to use limited labeled support images to identify the segmentation of new classes of objects, which is very practical in the real world. Previous researches were primarily based on prototypes or correlations. Due to colors, textures, and styles are similar in the same image, we argue that the query image can be regarded as its own support image. In this paper, we proposed the Target-aware Bi-Transformer Network (TBTNet) to equivalent treat of support images and query image. A vigorous Target-aware Transformer Layer (TTL) also be designed to distill correlations and force the model to focus on foreground information. It treats the hypercorrelation as a feature, resulting a significant reduction in the number of feature channels. Benefit from this characteristic, our model is the lightest up to now with only 0.4M learnable parameters. Futhermore, TBTNet converges in only 10% to 25% of the training epochs compared to traditional methods. The excellent performance on standard FSS benchmarks of PASCAL-5i and COCO-20i proves the efficiency of our method. Extensive ablation studies were also carried out to evaluate the effectiveness of Bi-Transformer architecture and TTL.
翻译:传统语义分割任务需要大量标签,且难以识别未学习过的类别。少样本语义分割(FSS)旨在利用有限的标注支持图像识别新类别物体的分割区域,这在现实世界中极具实用性。以往研究主要基于原型或相关性方法。鉴于同一图像中颜色、纹理和风格具有相似性,我们提出可将查询图像视作其自身的支持图像。本文提出了面向目标的双Transformer网络(TBTNet),以同等对待支持图像和查询图像。同时设计了强大的面向目标的Transformer层(TTL),用于提取相关性并迫使模型聚焦于前景信息。该层将超相关性作为特征处理,显著减少了特征通道数量。凭借这一特性,我们的模型成为目前最轻量的模型,仅包含0.4M可学习参数。此外,TBTNet的训练收敛轮次仅为传统方法的10%至25%。在PASCAL-5i和COCO-20i标准FSS基准测试中的优异表现证明了我们方法的有效性。我们还进行了广泛的消融研究,以评估双Transformer架构和TTL的有效性。