The use of large transformer-based models such as BERT, GPT, and T5 has led to significant advancements in natural language processing. However, these models are computationally expensive, necessitating model compression techniques that reduce their size and complexity while maintaining accuracy. This project investigates and applies knowledge distillation for BERT model compression, specifically focusing on the TinyBERT student model. We explore various techniques to improve knowledge distillation, including experimentation with loss functions, transformer layer mapping methods, and tuning the weights of attention and representation loss and evaluate our proposed techniques on a selection of downstream tasks from the GLUE benchmark. The goal of this work is to improve the efficiency and effectiveness of knowledge distillation, enabling the development of more efficient and accurate models for a range of natural language processing tasks.
翻译:基于Transformer的大规模模型(如BERT、GPT和T5)在自然语言处理领域取得了显著进展。然而,这些模型计算开销巨大,亟需在保持精度的前提下缩减模型规模与复杂度的压缩技术。本项目系统研究并应用知识蒸馏方法实现BERT模型压缩,重点聚焦TinyBERT学生模型。我们探索了多种改进知识蒸馏的技术手段,包括损失函数实验、Transformer层映射方法、注意力与表示损失的权重调优,并在GLUE基准测试中选取的下游任务上评估了所提方案。本研究的目的是提升知识蒸馏的效能与效率,从而为各类自然语言处理任务开发更高效、更精准的模型。