Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but two unresolved and critical issues demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As transformers have dominated multiple modalities, it remains to be investigated whether ConvNets also have a strong universal perception ability in domains beyond vision. In this paper, we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep. Following such guidelines, our proposed large-kernel ConvNet shows leading performance in image recognition (ImageNet accuracy of 88.0%, ADE20K mIoU of 55.6%, and COCO box AP of 56.4%), demonstrating better performance and higher speed than the recent powerful competitors. 2) We discover large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. With certain modality-related preprocessing approaches, the proposed model achieves state-of-the-art performance on time-series forecasting and audio recognition tasks even without modality-specific customization to the architecture. All the code and models are publicly available on GitHub and Huggingface.
翻译:大核卷积神经网络近年来受到广泛研究关注,但有两个尚未解决的关键问题亟待深入探索。1)现有大核卷积网络的架构大多沿袭传统卷积网络或Transformer的设计原则,而针对大核卷积网络的专用架构设计仍不充分。2)由于Transformer已在多模态领域占据主导地位,卷积网络在视觉之外的领域是否具备强大的通用感知能力仍有待探究。本文从两个方面做出贡献。1)我们提出大核卷积网络的四项架构设计准则,其核心在于充分利用大核区别于小核的本质特性——无须深度堆叠即可获得大视野。遵循这些准则,我们所提出的大核卷积网络在图像识别任务中展现领先性能(ImageNet准确率88.0%,ADE20K mIoU 55.6%,COCO框AP 56.4%),在性能与速度上均优于近期强基准模型。2)我们发现大核是解锁卷积网络在原生不擅长领域突破性表现的关键。结合特定模态的预处理方法,本文模型在时间序列预测与音频识别任务上达到最优性能,且无需对架构进行模态定制。所有代码与模型均已公开在GitHub和Huggingface平台。