Transformer has been popular in recent crowd counting work since it breaks the limited receptive field of traditional CNNs. However, since crowd images always contain a large number of similar patches, the self-attention mechanism in Transformer tends to find a homogenized solution where the attention maps of almost all patches are identical. In this paper, we address this problem by proposing Gramformer: a graph-modulated transformer to enhance the network by adjusting the attention and input node features respectively on the basis of two different types of graphs. Firstly, an attention graph is proposed to diverse attention maps to attend to complementary information. The graph is building upon the dissimilarities between patches, modulating the attention in an anti-similarity fashion. Secondly, a feature-based centrality encoding is proposed to discover the centrality positions or importance of nodes. We encode them with a proposed centrality indices scheme to modulate the node features and similarity relationships. Extensive experiments on four challenging crowd counting datasets have validated the competitiveness of the proposed method. Code is available at {https://github.com/LoraLinH/Gramformer}.
翻译:Transformer因其突破了传统CNN有限感受野的限制,在近期的人群计数工作中受到广泛关注。然而,由于人群图像中始终包含大量相似块,Transformer中的自注意力机制倾向于生成同质化解——几乎所有注意力图都趋于相同。针对这一问题,本文提出Gramformer:一种基于图调制的Transformer,通过分别基于两类不同图结构调整注意力与输入节点特征来增强网络性能。首先,提出注意力图以实现注意力图的多样化,使其关注互补信息。该图基于图像块之间的差异性构建,以反相似性方式调制注意力。其次,提出基于特征的中心性编码方法,用于发现节点的中心位置或重要性。我们通过提出的中心性指标方案对其进行编码,从而调制节点特征与相似性关系。在四个具有挑战性的人群计数数据集上的大量实验验证了所提方法的竞争力。代码已开源至{https://github.com/LoraLinH/Gramformer}。