Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.
翻译:预训练语言模型(PLM),例如BERT或RoBERTa,在标注数据上微调后,已成为自然语言理解任务的最先进技术。然而,由于对GPU内存需求大、推理延迟高,这些模型的大尺寸为其在真实应用中的推理部署带来了挑战。本文探索了基于神经架构搜索(NAS)的结构剪枝方法,旨在从微调后的网络中找出能够最优权衡效率(例如模型尺寸或延迟)与泛化性能的子网络。我们还展示了如何利用近年来发展的两阶段权重共享NAS方法来加速这一搜索过程。与采用固定阈值的传统剪枝方法不同,我们提出了一种多目标方法,用于识别帕累托最优子网络集,从而支持更灵活、更自动化的压缩流程。