We develop a unified theoretical framework for sparse knowledge distillation based on probability-domain softening operators. While the equivalence $p^{1/T} \propto \mathrm{softmax}(z/T)$ is well known, our contribution is an operator-level analytical framework built on this foundation rather than the equivalence itself. The framework comprises four core components: (i) operator-agnostic bias--variance decompositions that characterize when sparse students outperform dense teachers, (ii) a homotopy path formalization of multi-stage pruning in function space explaining why iterative compression succeeds where one-shot pruning fails, (iii) convergence guarantees establishing $O(1/n)$ rates for $n$-stage distillation with explicit parameter dependence, and (iv) equivalence class characterizations identifying distinct probability-domain operators that yield identical student models under capacity constraints. We introduce an axiomatic definition of probability-domain softening operators based on ranking preservation, continuity, entropy monotonicity, identity, and boundary behavior, and show that multiple non-equivalent operator families satisfy these axioms. All learning-theoretic guarantees are shown to hold uniformly across this operator class, independent of implementation details. These results provide theoretical grounding for black-box teacher distillation, partial-access settings such as top-$k$ truncation and text-only outputs, and privacy-preserving model compression.
翻译:我们基于概率域软化算子构建了一个统一的稀疏知识蒸馏理论框架。虽然等价关系 $p^{1/T} \propto \mathrm{softmax}(z/T)$ 已广为人知,但我们的贡献在于以此为基础构建了一个算子层面的分析框架,而非该等价关系本身。该框架包含四个核心组成部分:(i) 算子无关的偏差-方差分解,用于刻画稀疏学生模型何时优于稠密教师模型;(ii) 函数空间中多阶段剪枝的同伦路径形式化,解释为何迭代压缩在单次剪枝失败的情况下能够成功;(iii) 收敛性保证,为 $n$ 阶段蒸馏建立了具有显式参数依赖的 $O(1/n)$ 收敛速率;(iv) 等价类刻画,识别在容量约束下产生相同学生模型的不同概率域算子。我们基于排序保持性、连续性、熵单调性、恒等性及边界行为,提出了概率域软化算子的公理化定义,并证明了多个非等价算子族均满足这些公理。所有学习理论保证均被证明在该算子类上一致成立,且与实现细节无关。这些结果为黑盒教师蒸馏、部分访问设置(如 top-$k$ 截断与纯文本输出)以及隐私保护模型压缩提供了理论基础。