State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.
翻译:基于Transformer的先进语义分割方法通常采用Transformer解码器,其通过交叉注意力从图像嵌入中提取附加嵌入,通过自注意力优化其中一种或两种嵌入类型,并通过点积运算将图像嵌入投影到附加嵌入上。尽管这些方法取得了显著成功,但此类经验性设计仍缺乏理论依据或解释,从而阻碍了潜在的原理性改进。本文认为语义分割与压缩之间存在本质联系,特别是Transformer解码器与主成分分析(PCA)之间的关联。基于这一视角,我们推导出一个具有白盒特性、完全基于注意力的原理性语义分割解码器(DEPICT),其解释如下:1)自注意力算子通过优化图像嵌入来构建与监督信号对齐且保留大部分信息的理想主成分子空间;2)交叉注意力算子旨在寻找优化后图像嵌入的低秩近似,该近似预期成为主成分子空间的一组标准正交基,并与预定义类别相对应;3)点积运算为图像嵌入生成紧凑的表示形式作为分割掩码。在ADE20K数据集上的实验表明,DEPICT始终优于其黑盒对照方法Segmenter,且具有轻量化与更强鲁棒性的特点。