Aggregation of multi-stage features has been revealed to play a significant role in semantic segmentation. Unlike previous methods employing point-wise summation or concatenation for feature aggregation, this study proposes the Category Feature Transformer (CFT) that explores the flow of category embedding and transformation among multi-stage features through the prevalent multi-head attention mechanism. CFT learns unified feature embeddings for individual semantic categories from high-level features during each aggregation process and dynamically broadcasts them to high-resolution features. Integrating the proposed CFT into a typical feature pyramid structure exhibits superior performance over a broad range of backbone networks. We conduct extensive experiments on popular semantic segmentation benchmarks. Specifically, the proposed CFT obtains a compelling 55.1% mIoU with greatly reduced model parameters and computations on the challenging ADE20K dataset.
翻译:多阶段特征聚合已被证明在语义分割中发挥着重要作用。与以往采用逐点相加或拼接进行特征聚合的方法不同,本研究提出类别特征变换器(Category Feature Transformer, CFT),通过流行的多头注意力机制探索多阶段特征间类别嵌入与变换的流动。CFT在每次聚合过程中从高层特征中学习各个语义类别的统一特征嵌入,并将其动态广播至高分辨率特征。将所提出的CFT集成到典型特征金字塔结构中后,在多种骨干网络上展现出优越性能。我们在主流语义分割基准上进行了大量实验。具体而言,在具有挑战性的ADE20K数据集上,所提出的CFT以大幅减少的模型参数与计算量获得了令人瞩目的55.1% mIoU。