Dynamic networks, e.g., Dynamic Convolution (DY-Conv) and the Mixture of Experts (MoE), have been extensively explored as they can considerably improve the model's representation power with acceptable computational cost. The common practice in implementing dynamic networks is to convert the given static layers into fully dynamic ones where all parameters are dynamic (at least within a single layer) and vary with the input. However, such a fully dynamic setting may cause redundant parameters and high deployment costs, limiting the applicability of dynamic networks to a broader range of tasks and models. The main contributions of our work are challenging the basic commonsense in dynamic networks and proposing a partially dynamic network, namely PAD-Net, to transform the redundant dynamic parameters into static ones. Also, we further design Iterative Mode Partition to partition dynamic and static parameters efficiently. Our method is comprehensively supported by large-scale experiments with two typical advanced dynamic architectures, i.e., DY-Conv and MoE, on both image classification and GLUE benchmarks. Encouragingly, we surpass the fully dynamic networks by $+0.7\%$ top-1 acc with only $30\%$ dynamic parameters for ResNet-50 and $+1.9\%$ average score in language understanding with only $50\%$ dynamic parameters for BERT. Code will be released at: \url{https://github.com/Shwai-He/PAD-Net}.
翻译:动态网络(例如动态卷积(DY-Conv)和混合专家模型(MoE))因其能以可接受的计算成本显著提升模型表示能力而受到广泛探索。实现动态网络的常见做法是将给定的静态层转换为完全动态层,其中所有参数(至少在同一层内)均为动态且随输入变化。然而,这种全动态设置可能导致参数冗余和部署成本高昂,限制了动态网络在更广泛任务和模型中的适用性。我们的主要贡献在于挑战动态网络的基本常识,提出一种部分动态网络(即PAD-Net),将冗余动态参数转化为静态参数。此外,我们进一步设计了迭代模式划分方法,以高效分离动态与静态参数。通过两种典型高级动态架构(即DY-Conv和MoE)在图像分类和GLUE基准上的大规模实验,我们的方法得到了全面支持。令人振奋的是,在仅使用30%动态参数的情况下,我们以ResNet-50实现了超越全动态网络0.7%的top-1准确率;在仅使用50%动态参数的情况下,以BERT实现了语言理解任务平均得分提升1.9%。代码将发布于:\url{https://github.com/Shwai-He/PAD-Net}。