Applying knowledge distillation encourages a student model to behave more like a teacher model, largely retaining the performance of the teacher model, even though the student model may have substantially fewer parameters. However, while distillation helps student models behave more like teacher models in-distribution, this is not necessarily the case out-of-distribution. To address this, we use a language model to create task-specific unlabeled data that mimics the data in targeted out-of-distribution domains. We use this generated data for knowledge distillation on the task of Natural Language Inference (NLI), encouraging the student models to behave more like the teacher models for these examples. Our domain-targeted augmentation is highly effective, and outperforms previous robustness methods when evaluating out-of-distribution performance on MNLI. Surprisingly, this method also improves performance on out-of-distribution domains that the data was not generated for. We additionally introduce Distilled Minority Upsampling (DMU), a method for identifying and upsampling minority examples during the distillation. DMU is complementary to the domain-targeted augmentation, and substantially improves performance on SNLI-hard. Finally, we show out-of-distribution improvements on HANS from both of our methods, despite augmenting the training data with fewer than 5k examples.
翻译:知识蒸馏通过引导学生模型更接近教师模型的行为,能在学生模型参数显著减少的情况下,基本保留教师模型的性能。然而,尽管蒸馏有助于学生模型在分布内数据中模仿教师模型的行为,但在分布外场景下并不总是如此。为解决这一问题,我们利用语言模型生成针对特定任务的未标注数据,模拟目标分布外领域的数据分布。在自然语言推理(NLI)任务中,我们使用生成的此类数据进行知识蒸馏,促使学生模型在分布外样本上更接近教师模型的行为。我们的领域定向增强方法效果显著,在MNLI数据集上评估分布外性能时,优于以往的鲁棒性方法。令人惊讶的是,该方法还能提升未参与生成数据的其他分布外领域的性能。此外,我们提出了蒸馏少数类上采样(DMU)方法,用于在蒸馏过程中识别并提升少数类样本的采样权重。DMU与领域定向增强方法互补,显著提升了SNLI-hard上的性能。最后,尽管仅用少于5000个训练增强样本,我们在HANS数据集上验证了两种方法对分布外性能的提升效果。