The rapid advances in Vision Transformer (ViT) refresh the state-of-the-art performances in various vision tasks, overshadowing the conventional CNN-based models. This ignites a few recent striking-back research in the CNN world showing that pure CNN models can achieve as good performance as ViT models when carefully tuned. While encouraging, designing such high-performance CNN models is challenging, requiring non-trivial prior knowledge of network design. To this end, a novel framework termed Mathematical Architecture Design for Deep CNN (DeepMAD) is proposed to design high-performance CNN models in a principled way. In DeepMAD, a CNN network is modeled as an information processing system whose expressiveness and effectiveness can be analytically formulated by their structural parameters. Then a constrained mathematical programming (MP) problem is proposed to optimize these structural parameters. The MP problem can be easily solved by off-the-shelf MP solvers on CPUs with a small memory footprint. In addition, DeepMAD is a pure mathematical framework: no GPU or training data is required during network design. The superiority of DeepMAD is validated on multiple large-scale computer vision benchmark datasets. Notably on ImageNet-1k, only using conventional convolutional layers, DeepMAD achieves 0.7% and 1.5% higher top-1 accuracy than ConvNeXt and Swin on Tiny level, and 0.8% and 0.9% higher on Small level.
翻译:视觉Transformer(ViT)的快速发展刷新了各类视觉任务的性能标杆,使传统基于CNN的模型黯然失色。这引发了CNN领域近期一系列引人瞩目的"反击"研究——实验表明,经过精心调校的纯CNN模型可获得与ViT模型媲美的性能。然而,设计这类高性能CNN模型颇具挑战性,需要非平凡的神经网络设计先验知识。为此,本文提出了一种名为"深度卷积神经网络数学架构设计"(DeepMAD)的新型框架,能够以规范化方式设计高性能CNN模型。在DeepMAD中,CNN网络被建模为信息处理系统,其表达能力与有效性可通过结构参数进行解析式量化。继而构建了一个带约束的数学规划(MP)问题来优化这些结构参数。该MP问题可通过现成的MP求解器在CPU上轻松求解,且内存占用极小。值得注意的是,DeepMAD是纯数学框架:网络设计过程无需GPU或训练数据。在多个大规模计算机视觉基准数据集上的实验验证了DeepMAD的优越性。特别是在ImageNet-1k数据集上,仅使用传统卷积层,DeepMAD在Tiny级别相比ConvNeXt和Swin分别提升0.7%和1.5%的Top-1准确率,在Small级别分别提升0.8%和0.9%。