通过$D$-门控实现的可微稀疏化：一种简单通用的结构化惩罚方法 (Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization)

Structured sparsity regularization offers a principled way to compact neural networks, but its non-differentiability breaks compatibility with conventional stochastic gradient descent and requires either specialized optimizers or additional post-hoc pruning without formal guarantees. In this work, we propose $D$-Gating, a fully differentiable structured overparameterization that splits each group of weights into a primary weight vector and multiple scalar gating factors. We prove that any local minimum under $D$-Gating is also a local minimum using non-smooth structured $L_{2,2/D}$ penalization, and further show that the $D$-Gating objective converges at least exponentially fast to the $L_{2,2/D}$-regularized loss in the gradient flow limit. Together, our results show that $D$-Gating is theoretically equivalent to solving the original group sparsity problem, yet induces distinct learning dynamics that evolve from a non-sparse regime into sparse optimization. We validate our theory across vision, language, and tabular tasks, where $D$-Gating consistently delivers strong performance-sparsity tradeoffs and outperforms both direct optimization of structured penalties and conventional pruning baselines.

翻译：结构化稀疏正则化为压缩神经网络提供了一种原理性方法，但其不可微性使其与传统的随机梯度下降不兼容，需要专门的优化器或额外的后剪枝步骤，且缺乏形式化保证。本文提出$D$-门控，一种完全可微的结构化过参数化方法，将每组权重拆分为一个主权重向量和多个标量门控因子。我们证明，在$D$-门控下的任何局部极小值也是使用非光滑结构化$L_{2,2/D}$惩罚的局部极小值，并进一步证明在梯度流极限下，$D$-门控目标函数至少以指数速度收敛到$L_{2,2/D}$正则化损失。综合来看，我们的结果表明$D$-门控在理论上等价于求解原始分组稀疏问题，同时诱导了从非稀疏状态演化为稀疏优化的独特学习动态。我们在视觉、语言和表格任务中验证了理论，$D$-门控始终展现出优异的性能-稀疏度权衡，且优于直接优化结构化惩罚和传统剪枝基线方法。