Given the generational gap in available hardware between lay practitioners and the most endowed institutions, LLMs are becoming increasingly inaccessible as they grow in size. Whilst many approaches have been proposed to compress LLMs to make their resource consumption manageable, these methods themselves tend to be resource intensive, putting them out of the reach of the very user groups they target. In this work, we explore the problem of structured pruning of LLMs using only forward passes. We seek to empower practitioners to prune models so large that their available hardware has just enough memory to run inference. We develop Bonsai, a gradient-free, perturbative pruning method capable of delivering small, fast, and accurate pruned models. We observe that Bonsai outputs pruned models that (i) outperform those generated by more expensive gradient-based structured pruning methods, and (ii) are twice as fast (with comparable accuracy) as those generated by semi-structured pruning methods requiring comparable resources as Bonsai. We also leverage Bonsai to produce a new sub-2B model using a single A6000 that yields state-of-the-art performance on 4/6 tasks on the Huggingface Open LLM leaderboard.
翻译:鉴于普通从业者与顶级机构在可用硬件上的代际差距,大型语言模型(LLM)因规模增长而变得日益难以触及。尽管已有多种方法被提出用于压缩LLM以降低其资源消耗,但这些方法本身往往需要大量计算资源,使其目标用户群体反而难以企及。本研究探讨仅利用前向传播实现LLM结构化剪枝的问题,旨在帮助从业者修剪那些硬件仅够运行推理的超大规模模型。我们提出Bonsai——一种无需梯度的微扰式剪枝方法,能够生成体积小、速度快且精度高的剪枝模型。实验表明:Bonsai输出的剪枝模型(i)性能优于计算成本更高的梯度结构化剪枝方法;(ii)与消耗相当资源的半结构化剪枝方法相比,推理速度提升两倍且精度相当。我们进一步利用Bonsai在单张A6000显卡上生成参数低于20亿的新模型,其在Huggingface Open LLM排行榜的4/6项任务上达到先进水平。