稀疏知识蒸馏：基于概率域温度缩放与多阶段压缩的数学框架 (Sparse Knowledge Distillation: A Mathematical Framework for Probability-Domain Temperature Scaling and Multi-Stage Compression)

from arxiv, Machine learning theory. Develops an axiomatic, operator-agnostic framework for probability-domain knowledge distillation, including bias--variance analysis of sparse students, homotopy-based multi-stage pruning, $O(1/n)$ convergence guarantees, and equivalence classes of probability-domain softening operators. Theoretical analysis only

We develop a unified theoretical framework for sparse knowledge distillation based on probability-domain softening operators. While the equivalence $p^{1/T} \propto \mathrm{softmax}(z/T)$ is well known, our contribution is an operator-level analytical framework built on this foundation rather than the equivalence itself. The framework comprises four core components: (i) operator-agnostic bias--variance decompositions that characterize when sparse students outperform dense teachers, (ii) a homotopy path formalization of multi-stage pruning in function space explaining why iterative compression succeeds where one-shot pruning fails, (iii) convergence guarantees establishing $O(1/n)$ rates for $n$-stage distillation with explicit parameter dependence, and (iv) equivalence class characterizations identifying distinct probability-domain operators that yield identical student models under capacity constraints. We introduce an axiomatic definition of probability-domain softening operators based on ranking preservation, continuity, entropy monotonicity, identity, and boundary behavior, and show that multiple non-equivalent operator families satisfy these axioms. All learning-theoretic guarantees are shown to hold uniformly across this operator class, independent of implementation details. These results provide theoretical grounding for black-box teacher distillation, partial-access settings such as top-$k$ truncation and text-only outputs, and privacy-preserving model compression.

翻译：我们基于概率域软化算子构建了一个统一的稀疏知识蒸馏理论框架。虽然等价关系 $p^{1/T} \propto \mathrm{softmax}(z/T)$ 已广为人知，但我们的贡献在于以此为基础构建了一个算子层面的分析框架，而非该等价关系本身。该框架包含四个核心组成部分：(i) 算子无关的偏差-方差分解，用于刻画稀疏学生模型何时优于稠密教师模型；(ii) 函数空间中多阶段剪枝的同伦路径形式化，解释为何迭代压缩在单次剪枝失败的情况下能够成功；(iii) 收敛性保证，为 $n$ 阶段蒸馏建立了具有显式参数依赖的 $O(1/n)$ 收敛速率；(iv) 等价类刻画，识别在容量约束下产生相同学生模型的不同概率域算子。我们基于排序保持性、连续性、熵单调性、恒等性及边界行为，提出了概率域软化算子的公理化定义，并证明了多个非等价算子族均满足这些公理。所有学习理论保证均被证明在该算子类上一致成立，且与实现细节无关。这些结果为黑盒教师蒸馏、部分访问设置（如 top-$k$ 截断与纯文本输出）以及隐私保护模型压缩提供了理论基础。