At the heart of foundation models is the philosophy of "more is different", exemplified by the astonishing success in computer vision and natural language processing. However, the challenges of optimization and inherent complexity of transformer models call for a paradigm shift towards simplicity. In this study, we introduce VanillaNet, a neural network architecture that embraces elegance in design. By avoiding high depth, shortcuts, and intricate operations like self-attention, VanillaNet is refreshingly concise yet remarkably powerful. Each layer is carefully crafted to be compact and straightforward, with nonlinear activation functions pruned after training to restore the original architecture. VanillaNet overcomes the challenges of inherent complexity, making it ideal for resource-constrained environments. Its easy-to-understand and highly simplified architecture opens new possibilities for efficient deployment. Extensive experimentation demonstrates that VanillaNet delivers performance on par with renowned deep neural networks and vision transformers, showcasing the power of minimalism in deep learning. This visionary journey of VanillaNet has significant potential to redefine the landscape and challenge the status quo of foundation model, setting a new path for elegant and effective model design. Pre-trained models and codes are available at https://github.com/huawei-noah/VanillaNet and https://gitee.com/mindspore/models/tree/master/research/cv/vanillanet.
翻译:基础模型的核心是“多即不同”的哲学,这在计算机视觉和自然语言处理领域的显著成功中得以体现。然而,Transformer模型的优化挑战与内在复杂性呼唤向简洁性转型的研究范式。本研究提出VanillaNet,一种拥抱优雅设计的神经网络架构。通过避免高深度、捷径操作及自注意力等复杂机制,VanillaNet以清新简洁的结构展现出惊人性能。其每个层均经精心设计而紧凑直接,并在训练后修剪非线性激活函数以恢复原始架构。VanillaNet克服了内在复杂性难题,成为资源受限环境的理想选择。该架构易于理解且高度简化,为高效部署开辟新可能。大量实验表明,VanillaNet的性能可与著名深度神经网络及视觉Transformer相媲美,展现了深度学习中的极简主义力量。VanillaNet的前瞻性探索有望重新定义基础模型格局,挑战现有范式,为优雅高效的模型设计开辟新路径。预训练模型与代码已开源至https://github.com/huawei-noah/VanillaNet 及 https://gitee.com/mindspore/models/tree/master/research/cv/vanillanet。