Efficiently and reliably estimating uncertainty is an important objective in deep learning. It is especially pertinent to autoregressive sequence tasks, where training and inference costs are typically very high. However, existing research has predominantly focused on tasks with static data such as image classification. In this work, we investigate Ensemble Distribution Distillation (EDD) applied to large-scale natural language sequence-to-sequence data. EDD aims to compress the superior uncertainty performance of an expensive (teacher) ensemble into a cheaper (student) single model. Importantly, the ability to separate knowledge (epistemic) and data (aleatoric) uncertainty is retained. Existing probability-space approaches to EDD, however, are difficult to scale to large vocabularies. We show, for modern transformer architectures on large-scale translation tasks, that modelling the ensemble logits, instead of softmax probabilities, leads to significantly better students. Moreover, the students surprisingly even outperform Deep Ensembles by up to ~10% AUROC on out-of-distribution detection, whilst matching them at in-distribution translation.
翻译:高效可靠地估计不确定性是深度学习的重要目标,尤其适用于自回归序列任务,此类任务的训练和推理成本通常极高。然而,现有研究主要集中在静态数据任务上,如图像分类。在本工作中,我们研究了应用于大规模自然语言序列到序列数据的集成分布蒸馏(EDD)。EDD旨在将昂贵(教师)集成模型的优越不确定性性能压缩为廉价(学生)单一模型。重要的是,这种方法保留了分离知识(认知)不确定性和数据(偶然)不确定性的能力。然而,现有的基于概率空间的EDD方法难以扩展到大型词汇表。我们表明,对于现代Transformer架构的大规模翻译任务,建模集成对数几率(而非Softmax概率)可显著提升学生模型性能。此外,学生模型在分布外检测中的AUROC指标意外地比深度集成模型高出约10%,同时在分布内翻译任务上与深度集成模型性能相当。