Recently, some large kernel convnets strike back with appealing performance and efficiency. However, given the square complexity of convolution, scaling up kernels can bring about an enormous amount of parameters and the proliferated parameters can induce severe optimization problem. Due to these issues, current CNNs compromise to scale up to 51x51 in the form of stripe convolution (i.e., 51x5 + 5x51) and start to saturate as the kernel size continues growing. In this paper, we delve into addressing these vital issues and explore whether we can continue scaling up kernels for more performance gains. Inspired by human vision, we propose a human-like peripheral convolution that efficiently reduces over 90% parameter count of dense grid convolution through parameter sharing, and manage to scale up kernel size to extremely large. Our peripheral convolution behaves highly similar to human, reducing the complexity of convolution from O(K^2) to O(logK) without backfiring performance. Built on this, we propose Parameter-efficient Large Kernel Network (PeLK). Our PeLK outperforms modern vision Transformers and ConvNet architectures like Swin, ConvNeXt, RepLKNet and SLaK on various vision tasks including ImageNet classification, semantic segmentation on ADE20K and object detection on MS COCO. For the first time, we successfully scale up the kernel size of CNNs to an unprecedented 101x101 and demonstrate consistent improvements.
翻译:摘要:近期,一些大核卷积网络以出色的性能和效率重新受到关注。然而,由于卷积的平方复杂度,扩大核尺寸会带来庞大的参数量,而激增的参数又会引发严重的优化问题。受限于这些难点,当前卷积网络以条状卷积形式(如51×5+5×51)最多扩展到51×51,且随着核尺寸持续增大,性能开始趋于饱和。本文深入探讨这些关键问题,并研究能否进一步扩大核尺寸以获取更多性能增益。受人类视觉启发,我们提出一种类人外周卷积,通过参数共享高效减少密集网格卷积超90%的参数量,并成功将核尺寸扩展至极大尺度。我们所提出的外周卷积行为与人类视觉高度相似,将卷积复杂度从O(K²)降至O(logK)且未牺牲性能。基于此,我们提出参数高效大核网络(PeLK)。该网络在ImageNet分类、ADE20K语义分割和MS COCO目标检测等多种视觉任务中,性能优于Swin、ConvNeXt、RepLKNet和SLaK等现代视觉Transformer与卷积网络架构。我们首次成功将卷积神经网络核尺寸扩展至前所未有的101×101,并证明了其持续的性能提升。