The multi-level aggregation (MLA) module has emerged as a critical component for advancing new-era vision back-bones in semantic segmentation. In this paper, we propose Lawin (large window) Transformer, a novel MLA architecture that creatively utilizes multi-scale feature maps from the vision backbone. At the core of Lawin Transformer is the Lawin attention, a newly designed window attention mechanism capable of querying much larger context windows than local windows. We focus on studying the efficient and simplistic application of the large-window paradigm, allowing for flexible regulation of the ratio of large context to query and capturing multi-scale representations. We validate the effectiveness of Lawin Transformer on Cityscapes and ADE20K, consistently demonstrating great superiority to widely-used MLA modules when combined with new-era vision backbones. The code is available at https://github.com/yan-hao-tian/lawin.
翻译:多级聚合(MLA)模块已成为推动新时代语义分割视觉骨干网络发展的关键组件。本文提出Lawin(大窗口)Transformer——一种新颖的MLA架构,该架构创造性地利用视觉骨干网络生成的多尺度特征图。Lawin Transformer的核心是Lawin注意力机制,这是一种全新设计的窗口注意力机制,能够查询比局部窗口大得多的上下文窗口。我们专注于研究大窗口范式的高效且简单的应用方式,从而灵活调控大上下文与查询的比例,并捕获多尺度表示。我们在Cityscapes和ADE20K数据集上验证了Lawin Transformer的有效性,结果表明,当与新时代视觉骨干网络结合时,该架构在性能上持续显著优于广泛使用的MLA模块。代码开源于https://github.com/yan-hao-tian/lawin。