Large language models (LLMs) have revolutionized natural language processing (NLP) by excelling at understanding and generating human-like text. However, their widespread deployment can be prohibitively expensive. SortedNet is a recent training technique for enabling dynamic inference by leveraging the modularity in networks and sorting sub-models based on computation/accuracy in a nested manner. We extend SortedNet to generative NLP tasks, making large language models dynamic without any Pre-Training and by only replacing Standard Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT). Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference. We show that this approach can unlock the power of intermediate layers of transformers in generating the target output. Our sub-models remain integral components of the original model, minimizing storage requirements and transition costs between different computational/latency budgets. The efficacy of our proposed method was demonstrated by applying it to tune LLaMA 2 13B on the Stanford Alpaca dataset for instruction following and TriviaQA for closed-book question answering. Our results show the superior performance of sub-models in comparison to Standard Fine-Tuning and SFT+ICT (Early-Exit), all achieved with efficient tuning and without additional memory usage during inference.
翻译:大语言模型(LLMs)在理解与生成类人文本方面表现出色,彻底革新了自然语言处理(NLP)领域。然而,其大规模部署成本高昂。SortedNet是一种新型训练技术,通过利用网络的模块化特性,以嵌套方式根据计算量/准确率对子模型进行排序,从而实现动态推理。我们将SortedNet拓展至生成式NLP任务,无需预训练,仅用排序微调(SoFT)替代标准微调(SFT),即可使大语言模型具备动态性。该方法显著提升模型效率,消除了推理过程中针对不同场景维护多个模型的必要。研究表明,该方法能够释放Transformer中间层在目标输出生成中的潜力。子模型作为原模型的组成部分,最大限度降低了存储需求及不同计算/延迟预算间的切换成本。通过在Stanford Alpaca数据集上进行指令跟随任务微调LLaMA 2 13B,并在TriviaQA数据集上进行闭卷问答任务,验证了所提方法的有效性。结果表明,相较于标准微调(SFT)及SFT+ICT(早期退出)方法,子模型性能更优,且所有成果均在高效微调、推理过程无额外内存开销的条件下实现。