Layer-wise capacity in large language models is highly non-uniform: some layers contribute disproportionately to loss reduction while others are near-redundant. Existing methods for exploiting this non-uniformity, such as influence-function-based layer scoring, produce sensitivity estimates but offer no principled mechanism for translating them into allocation or pruning decisions under hardware constraints. We address this gap with a unified, curvature-aware framework grounded in the Minimum Description Length (MDL) principle. Our central quantity is the curvature-adjusted layer gain $ζ_k^2 = g_k^\top \widetilde{H}_{kk}^{-1} g_k$, which we show equals twice the maximal second-order reduction in empirical risk achievable by updating layer $k$ alone, and which strictly dominates gradient-norm-based scores by incorporating local curvature. Normalizing these gains into layer quality scores $q_k$, we formulate two convex MDL programs: a capacity allocation program that distributes expert slots or LoRA rank preferentially to high-curvature layers under diminishing returns, and a pruning program that concentrates sparsity on low-gain layers while protecting high-gain layers from degradation. Both programs admit unique closed-form solutions parameterized by a single dual variable, computable in $O(K \log 1/\varepsilon)$ via bisection. We prove an $O(δ^2)$ transfer regret bound showing that source-domain allocations remain near-optimal on target tasks when curvature scores drift by $δ$, with explicit constants tied to the condition number of the target program. Together, these results elevate layer-wise capacity optimization from an empirical heuristic to a theoretically grounded, computationally efficient framework with provable optimality and generalization guarantees.
翻译:大语言模型中的层间容量呈现高度非均匀分布:部分层对损失降低的贡献不成比例,而其他层则近乎冗余。现有利用这种非均匀性的方法(例如基于影响函数的层评分)虽能生成敏感性估计,但缺乏在硬件约束下将其转化为分配或剪枝决策的原则性机制。本文基于最小描述长度原理,提出一个统一的曲率感知框架来填补这一空白。我们的核心量是曲率调整层增益 $ζ_k^2 = g_k^\top \widetilde{H}_{kk}^{-1} g_k$,我们证明该量等于单独更新第 $k$ 层时经验风险可实现的二阶最大减少量的两倍,并且通过纳入局部曲率信息,该量严格优于基于梯度范数的评分。将这些增益归一化为层质量分数 $q_k$ 后,我们构建了两个凸 MDL 优化问题:一个容量分配问题,在收益递减条件下将专家槽位或 LoRA 秩优先分配给高曲率层;一个剪枝问题,将稀疏性集中于低增益层,同时保护高增益层免于性能退化。这两个问题均存在由单一对偶变量参数化的唯一闭式解,可通过二分法在 $O(K \log 1/\varepsilon)$ 复杂度内计算。我们证明了 $O(δ^2)$ 的迁移遗憾界,表明当曲率分数漂移 $δ$ 时,源域分配在目标任务上仍保持接近最优,其显式常数与目标问题条件数相关。综上,这些成果将层间容量优化从经验启发式方法提升为一个理论严谨、计算高效且具有可证明最优性与泛化保证的框架。