By contextualizing the kernel as global as possible, Modern ConvNets have shown great potential in computer vision tasks. However, recent progress on \textit{multi-order game-theoretic interaction} within deep neural networks (DNNs) reveals the representation bottleneck of modern ConvNets, where the expressive interactions have not been effectively encoded with the increased kernel size. To tackle this challenge, we propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning in pure ConvNet-based models with favorable complexity-performance trade-offs. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module, where discriminative features are efficiently gathered and contextualized adaptively. MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet and various downstream vision benchmarks, including COCO object detection, ADE20K semantic segmentation, 2D\&3D human pose estimation, and video prediction. Notably, MogaNet hits 80.0\% and 87.8\% accuracy with 5.2M and 181M parameters on ImageNet-1K, outperforming ParC-Net and ConvNeXt-L, while saving 59\% FLOPs and 17M parameters, respectively. The source code is available at \url{https://github.com/Westlake-AI/MogaNet}.
翻译:通过尽可能全局地语境化卷积核,现代卷积神经网络在计算机视觉任务中展现出巨大潜力。然而,深度神经网络中关于多阶博弈交互的最新研究揭示了现代ConvNets的表征瓶颈:随着卷积核尺寸增大,其表达性交互未能被有效编码。为应对这一挑战,我们提出新一代现代ConvNets——MogaNet,用于纯ConvNet模型中的判别性视觉表征学习,在性能与复杂度之间实现理想权衡。MogaNet将概念简洁而有效的卷积与门控聚合封装至紧凑模块中,自适应地高效收集与语境化判别性特征。在ImageNet及多项下游视觉基准(包括COCO目标检测、ADE20K语义分割、2D&3D人体姿态估计及视频预测)上,MogaNet展现出卓越的可扩展性、显著参数效率以及与顶尖ViTs和ConvNets相匹敌的竞争性能。值得注意的是,MogaNet在ImageNet-1K上分别以5.2M和181M参数实现80.0%与87.8%的准确率,在节省59% FLOPs和17M参数的同时,优于ParC-Net和ConvNeXt-L。源代码已公开于\url{https://github.com/Westlake-AI/MogaNet}。