Multi-scale learning is central to semantic segmentation. We visualize the effective receptive field (ERF) of canonical multi-scale representations and point out two risks in learning them: scale inadequacy and field inactivation. A novel multi-scale learner, varying window attention (VWA), is presented to address these issues. VWA leverages the local window attention (LWA) and disentangles LWA into the query window and context window, allowing the context's scale to vary for the query to learn representations at multiple scales. However, varying the context to large-scale windows (enlarging ratio R) can significantly increase the memory footprint and computation cost (R^2 times larger than LWA). We propose a simple but professional re-scaling strategy to zero the extra induced cost without compromising performance. Consequently, VWA uses the same cost as LWA to overcome the receptive limitation of the local window. Furthermore, depending on VWA and employing various MLPs, we introduce a multi-scale decoder (MSD), VWFormer, to improve multi-scale representations for semantic segmentation. VWFormer achieves efficiency competitive with the most compute-friendly MSDs, like FPN and MLP decoder, but performs much better than any MSDs. For instance, using nearly half of UPerNet's computation, VWFormer outperforms it by 1.0%-2.5% mIoU on ADE20K. With little extra overhead, ~10G FLOPs, Mask2Former armed with VWFormer improves by 1.0%-1.3%. The code and models are available at https://github.com/yan-hao-tian/vw
翻译:多尺度学习是语义分割的核心。我们可视化了典型多尺度表征的有效感受野(ERF),并指出了学习这些表征时存在的两个风险:尺度不足和感受野失活。为此,提出了一种新颖的多尺度学习器——可变窗口注意力(VWA),以解决上述问题。VWA利用局部窗口注意力(LWA),并将LWA分解为查询窗口和上下文窗口,允许上下文窗口的尺度随查询需求变化,从而学习多尺度表征。然而,将上下文窗口扩展至大尺度(扩大比例R)会导致显存占用与计算成本显著增加(为LWA的R²倍)。我们提出了一种简洁而专业的重缩放策略,可在不牺牲性能的前提下消除额外成本。因此,VWA能以与LWA相同的成本突破局部窗口的感受野限制。此外,基于VWA并结合多种MLP,我们引入了多尺度解码器(MSD)——VWFormer,以改进语义分割中的多尺度表征。VWFormer在计算效率上可与计算复杂度最低的MSD(如FPN和MLP解码器)相媲美,但性能显著优于任何MSD。例如,VWFormer使用UPerNet近一半的计算量,在ADE20K数据集上仍以1.0%-2.5%的mIoU超越后者。在仅增加约10G FLOPs的极少量额外开销下,配备VWFormer的Mask2Former性能提升了1.0%-1.3%。代码与模型已开源至https://github.com/yan-hao-tian/vw。