Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization has shown promise for LLMs, current methods typically require lengthy training to alleviate the performance degradation from quantization loss. However, deploying LLMs across diverse scenarios with different resource constraints, e.g., servers and personal computers, requires repeated training per application, which amplifies the lengthy training problem. Given that, it is advantageous to train a once-for-all (OFA) supernet capable of yielding diverse optimal subnets for downstream applications through one-shot training. Nonetheless, the scale of current language models impedes efficiency and amplifies interference from weight sharing between subnets. We make an initial attempt to extend the once-for-all framework to large language models. Specifically, we decouple shared weights to eliminate the interference and incorporate Low-Rank adapters for training efficiency. Furthermore, we observe the imbalance allocation of training resources from the traditional uniform sampling. A non-parametric scheduler is introduced to adjust the sampling rate for each quantization configuration, achieving a more balanced allocation among subnets with varying demands. We validate the approach on LLaMA2 families, and downstream evaluation confirms our ability to maintain high performance while significantly reducing deployment time faced with multiple scenarios.
翻译:大语言模型(LLMs)发展迅速,但面临显著的内存需求。虽然量化技术为LLMs带来了希望,但现有方法通常需要长时间训练以缓解量化损失带来的性能下降。然而,在具有不同资源限制(例如服务器和个人计算机)的多样化场景中部署LLMs,需要针对每个应用重复训练,这加剧了训练耗时问题。鉴于此,训练一个"一次训练,处处部署"的超网络具有显著优势,该网络能够通过单次训练为下游应用生成多样化的最优子网络。尽管如此,当前语言模型的规模限制了训练效率,并加剧了子网络间权重共享带来的干扰。我们首次尝试将"一次训练,处处部署"框架扩展至大语言模型。具体而言,我们解耦共享权重以消除干扰,并引入低秩适配器以提升训练效率。此外,我们观察到传统均匀采样导致的训练资源分配不均衡问题。为此,我们引入了一种非参数调度器来调整每种量化配置的采样率,从而在需求各异的子网络间实现更均衡的资源分配。我们在LLaMA2系列模型上验证了该方法,下游评估结果证实,在面临多场景部署时,我们能够在保持高性能的同时显著减少部署时间。