Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.
翻译:预训练语言模型(例如 BERT 或 RoBERTa)在标注数据上微调后,代表了自然语言理解任务的最先进水平。然而,其庞大的规模因显著的 GPU 内存需求和高推理延迟,给实际应用中的部署带来了挑战。本文探索了基于神经架构搜索的结构化剪枝方法,旨在寻找微调后网络中能够最优权衡效率(例如模型大小或延迟)与泛化性能的子结构。我们还展示了如何在此场景中利用近期发展的两阶段权重共享神经架构搜索方法来加速搜索过程。与采用固定阈值的传统剪枝方法不同,我们提出采用多目标优化方法识别帕累托最优子网络集合,从而实现更灵活、自动化的模型压缩过程。