Large language models (LLMs) exhibit remarkable reasoning abilities, allowing them to generalize across a wide range of downstream tasks, such as commonsense reasoning or instruction following. However, as LLMs scale, inference costs become increasingly prohibitive, accumulating significantly over their life cycle. This poses the question: Can we compress pre-trained LLMs to meet diverse size and latency requirements? We leverage Neural Architecture Search (NAS) to compress LLMs by pruning structural components, such as attention heads, neurons, and layers, aiming to achieve a Pareto-optimal balance between performance and efficiency. While NAS already achieved promising results on small language models in previous work, in this paper we propose various extensions that allow us to scale to LLMs. Compared to structural pruning baselines, we show that NAS improves performance up to 3.4% on MMLU with an on-device latency speedup.
翻译:大型语言模型(LLMs)展现出卓越的推理能力,使其能够在常识推理或指令遵循等广泛下游任务中实现泛化。然而,随着LLMs规模的扩大,推理成本日益高昂,在其生命周期中累积显著。这引出一个关键问题:我们能否通过压缩预训练LLMs来满足多样化的规模和延迟需求?本研究利用神经架构搜索(NAS)技术,通过剪枝注意力头、神经元和层等结构组件来压缩LLMs,旨在实现性能与效率的帕累托最优平衡。尽管先前研究已证明NAS在小型语言模型上取得显著成果,本文提出多项扩展方法使其能够适用于LLMs。与结构化剪枝基线相比,我们的方法在MMLU基准上最高提升3.4%性能,同时实现设备端延迟加速。