Recent work has focused on compressing pre-trained language models (PLMs) like BERT where the major focus has been to improve the in-distribution performance for downstream tasks. However, very few of these studies have analyzed the impact of compression on the generalizability and robustness of compressed models for out-of-distribution (OOD) data. Towards this end, we study two popular model compression techniques including knowledge distillation and pruning and show that the compressed models are significantly less robust than their PLM counterparts on OOD test sets although they obtain similar performance on in-distribution development sets for a task. Further analysis indicates that the compressed models overfit on the shortcut samples and generalize poorly on the hard ones. We further leverage this observation to develop a regularization strategy for robust model compression based on sample uncertainty. Experimental results on several natural language understanding tasks demonstrate that our bias mitigation framework improves the OOD generalization of the compressed models, while not sacrificing the in-distribution task performance.
翻译:近期研究聚焦于压缩预训练语言模型(如BERT),主要目标在于提升下游任务的分布内性能。然而,极少数工作分析了压缩对模型在分布外数据上的泛化能力与鲁棒性的影响。为此,我们研究了知识蒸馏与剪枝两种主流模型压缩技术,并表明:尽管压缩模型在任务的分布内开发集上获得与PLM相近的性能,但在分布外测试集上的鲁棒性显著更差。进一步分析显示,压缩模型对捷径样本过拟合,而对困难样本泛化不足。基于这一发现,我们设计了一种基于样本不确定性的正则化策略以实现鲁棒模型压缩。在多个自然语言理解任务上的实验结果表明,我们的偏差缓解框架能提升压缩模型的分布外泛化能力,同时不牺牲其在分布内任务上的性能。