Currently, the reduction in the parameter scale of large-scale pre-trained language models (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memory-limited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least 100B-scale PLMs and 25 mainstream methods on 8 NVIDIA A100 (40GB) GPUs.
翻译:当前,通过知识蒸馏降低大规模预训练语言模型(PLMs)的参数规模,极大地促进了其在各类设备上的广泛部署。然而,在真实的工业级应用中,知识蒸馏系统的部署面临巨大挑战:受限于GPU内存与方法切换问题,需要在更大规模PLMs(超过100亿参数)上使用复杂蒸馏方法。为应对这些挑战,我们提出GKD——一种通用知识蒸馏框架,支持在更大规模PLMs上运用多种蒸馏方法。借助GKD,开发者可在内存受限的GPU上构建更大的蒸馏模型,并在单一框架内轻松切换与组合不同蒸馏方法。实验结果表明,GKD能在8块NVIDIA A100(40GB)GPU上支持至少100亿参数规模的PLMs蒸馏及25种主流方法。