The abilities of modern large language models (LLMs) in solving natural language processing, complex reasoning, sentiment analysis and other tasks have been extraordinary which has prompted their extensive adoption. Unfortunately, these abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms. To mitigate this, we propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS. In particular, we fine-tune LLaMA2-7B only once and then apply genetic algorithm-based search to find smaller, less computationally complex network architectures. We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex. More specifically, we demonstrate a 1.5x reduction in model size and 1.3x speedup in throughput for certain tasks with negligible drop in accuracy. In addition to finding smaller, higher-performing network architectures, our method does so more effectively and efficiently than certain pruning or sparsification techniques. Finally, we demonstrate how quantization is complementary to our method and that the size and complexity of the networks we find can be further decreased using quantization. We believe that our work provides a way to automatically create LLMs which can be used on less expensive and more readily available hardware platforms.
翻译:现代大型语言模型(LLM)在解决自然语言处理、复杂推理、情感分析等任务方面展现出卓越能力,这推动了其广泛采用。然而,这些能力伴随着极高的内存与计算成本,导致LLM无法在多数硬件平台上部署。为缓解此问题,我们提出一种基于LLaMA2-7B的单次神经架构搜索方法,用于寻找帕累托最优网络架构。具体而言,我们仅对LLaMA2-7B进行一次微调,随后应用基于遗传算法的搜索来寻找更小、计算复杂度更低的网络架构。我们证明,针对某些标准基准任务,预训练的LLaMA2-7B网络存在不必要的规模过大与结构冗余。实验表明,在特定任务中可实现模型尺寸缩小1.5倍、吞吐速度提升1.3倍,且精度损失可忽略不计。与某些剪枝或稀疏化技术相比,我们的方法能以更高效率发现更小且性能更优的网络架构。最后,我们验证了量化技术与本方法的互补性:通过量化可进一步缩减所发现网络的规模与复杂度。我们相信,这项工作为自动构建适用于成本更低、更易获取硬件平台的大型语言模型提供了可行路径。