Deep ensembles deliver state-of-the-art, reliable uncertainty quantification, but their heavy computational and memory requirements hinder their practical deployments to real applications such as on-device AI. Knowledge distillation compresses an ensemble into small student models, but existing techniques struggle to preserve uncertainty partly because reducing the size of DNNs typically results in variation reduction. To resolve this limitation, we introduce a new method of distribution distillation (i.e. compressing a teacher ensemble into a student distribution instead of a student ensemble) called Gaussian distillation, which estimates the distribution of a teacher ensemble through a special Gaussian process called the deep latent factor model (DLF) by treating each member of the teacher ensemble as a realization of a certain stochastic process. The mean and covariance functions in the DLF model are estimated stably by using the expectation-maximization (EM) algorithm. By using multiple benchmark datasets, we demonstrate that the proposed Gaussian distillation outperforms existing baselines. In addition, we illustrate that Gaussian distillation works well for fine-tuning of language models and distribution shift problems.
翻译:深度集成方法能够提供最先进且可靠的量化不确定性,但其巨大的计算和内存需求阻碍了其在实际应用(如端侧人工智能)中的部署。知识蒸馏技术可将集成模型压缩为小型学生模型,但现有方法难以有效保留不确定性,部分原因在于减小深度神经网络规模通常会导致方差缩减。为克服这一局限性,我们提出一种新的分布蒸馏方法(即将教师集成模型压缩为学生分布而非学生集成模型),称为高斯蒸馏。该方法通过一种特殊的随机过程——深度隐因子模型(DLF)来估计教师集成模型的分布,将集成中的每个成员视为特定随机过程的实现。DLF模型中的均值函数与协方差函数通过期望最大化(EM)算法进行稳定估计。通过在多个基准数据集上的实验,我们证明所提出的高斯蒸馏方法优于现有基线模型。此外,我们展示了高斯蒸馏在语言模型微调及分布偏移问题中均能取得良好效果。