Knowledge distillation with multiple teachers is increasingly used to improve robustness, efficiency, and safety, yet existing approaches rely largely on heuristic or implementation-specific weighting schemes. This paper develops an operator-agnostic axiomatic framework for adaptive weighting in multi-teacher knowledge distillation across three complementary scales: token, task, and context. We formalize structural conditions under which adaptive weighting operators are well-defined, admit multiple non-equivalent implementations, and can be hierarchically composed via product-structure normalization. Within this framework, we establish existence and non-uniqueness of conforming operators, characterize convergence of gradient-based optimization under standard assumptions, analyze stability and perturbation robustness, and provide an abstract formulation of safety-constrained distillation. The results decouple theoretical guarantees from specific weighting formulas, enabling principled analysis of adaptive distillation methods under heterogeneity, distribution shift, and safety constraints.
翻译:多教师知识蒸馏日益广泛地应用于提升模型的鲁棒性、效率与安全性,然而现有方法主要依赖于启发式或实现特定的加权方案。本文针对多教师知识蒸馏中的自适应加权问题,提出了一个与具体算子无关的公理化框架,涵盖三个互补的尺度:词元、任务与上下文。我们形式化地定义了自适应加权算子满足良好定义性、允许多种非等价实现、并可通过乘积结构归一化进行层次组合的结构条件。在此框架内,我们证明了符合条件算子的存在性与非唯一性,刻画了在标准假设下基于梯度优化的收敛性,分析了稳定性与扰动鲁棒性,并给出了安全约束蒸馏的抽象表述。研究结果将理论保证与具体加权公式解耦,从而能够在异质性、分布偏移及安全约束条件下对自适应蒸馏方法进行原理性分析。