The multi-level aggregation (MLA) module has emerged as a critical component for advancing new-era vision back-bones in semantic segmentation. In this paper, we propose Lawin (large window) Transformer, a novel MLA architecture that creatively utilizes multi-scale feature maps from the vision backbone. At the core of Lawin Transformer is the Lawin attention, a newly designed window attention mechanism capable of querying much larger context windows than local windows. We focus on studying the efficient and simplistic application of the large-window paradigm, allowing for flexible regulation of the ratio of large context to query and capturing multi-scale representations. We validate the effectiveness of Lawin Transformer on Cityscapes and ADE20K, consistently demonstrating great superiority to widely-used MLA modules when combined with new-era vision backbones. The code is available at https://github.com/yan-hao-tian/lawin.
翻译:多级聚合(MLA)模块已成为推动语义分割中新时代视觉骨干网络发展的关键组件。本文提出Lawin(大窗口)Transformer,一种新颖的MLA架构,创造性地利用了视觉骨干网络的多尺度特征图。Lawin Transformer的核心是Lawin注意力机制,这是一种新设计的窗口注意力机制,能够查询比局部窗口更大的上下文窗口。我们专注于研究大窗口范式的高效且简洁的应用方式,从而灵活调节大上下文与查询的比例,并捕获多尺度表示。我们在Cityscapes和ADE20K数据集上验证了Lawin Transformer的有效性,结果表明,当与新时代视觉骨干网络结合时,该架构相较于广泛使用的MLA模块具有显著优越性。代码已开源至https://github.com/yan-hao-tian/lawin。