Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.
翻译:Transformer近期在计算机视觉领域展现出令人振奋的进展。本文通过改进原始金字塔视觉Transformer(PVT v1),添加三项设计构建了新基线模型,包括:(1) 线性复杂度注意力层,(2) 重叠补丁嵌入,以及(3) 卷积前馈网络。通过这些改进,PVT v2将PVT v1的计算复杂度降低至线性,并在分类、检测和分割等基础视觉任务上取得显著提升。值得注意的是,所提出的PVT v2取得了与Swin Transformer等近期工作相当甚至更优的性能。我们希望这项研究能推动计算机视觉领域Transformer前沿研究的发展。代码已开源至https://github.com/whai362/PVT。