Knowledge distillation is one of the primary methods of transferring knowledge from large to small models. However, it requires massive task-specific data, which may not be plausible in many real-world applications. Data augmentation methods such as representation interpolation, token replacement, or augmentation with models are applied to tackle this problem. However, these data augmentation methods either potentially cause shifts in decision boundaries (representation interpolation), are not expressive enough (token replacement), or introduce too much computational overhead (augmentation with models). To this end, we propose AugPro (Augmentation with Projection), an effective and efficient data augmentation method for distillation. Our method builds on top of representation interpolation augmentation methods to maintain the diversity of expressions and converts the augmented data to tokens to avoid shifting decision boundaries. It uses simple operations that come with little computational overhead. The results on multiple GLUE tasks show that our methods can improve distillation performance by a large margin at a low time cost. Codes are available at https://github.com/google-research/google-research/tree/master/augpro.
翻译:知识蒸馏是将知识从大模型迁移至小模型的主要方法之一。然而,该方法需要大量任务特定数据,这在许多实际应用中难以实现。为解决这一问题,现有数据增强方法包括表示插值、词元替换或基于模型增强等,但这些方法可能引发决策边界偏移(表示插值)、表达能力不足(词元替换)或引入过高计算开销(基于模型增强)。为此,我们提出AugPro(基于投影的数据增强方法),一种面向蒸馏任务的高效数据增强技术。该方法建立在表示插值类增强方法基础上,既保持表达多样性,又通过将增强数据转换为词元避免决策边界偏移,且仅采用简单操作,计算开销极低。在多个GLUE任务上的实验结果表明,本方法能以较低时间成本显著提升蒸馏性能。代码已开源:https://github.com/google-research/google-research/tree/master/augpro。