Large language models (LLMs) are increasingly adapted into domain-specific variants for applications in law, healthcare, and finance. Their scale, however, limits deployment in resource-constrained settings, and existing compression approaches often either degrade after domain adaptation or require substantial additional computation. We introduce EfficientXpert, a lightweight framework for domain pruning that integrates ForeSight Mask, a propagation-aware criterion for selecting weights to prune without backpropagation, and Partial Brain Surgeon, an efficient closed-form update for low-rank adapters under a fixed sparsity pattern. With fine-tuning cost comparable to standard LoRA, EfficientXpert converts a general pretrained model into a sparse, domain-adapted expert in a single pruning step. Across health and legal benchmarks, EfficientXpert reaches up to 98 percent of dense performance at 40 percent sparsity, improving over prior pruning baselines while matching LoRA training time and staying within 1 percent of LoRA peak GPU memory in our experiments.
翻译:大型语言模型(LLMs)正日益被适配为法律、医疗和金融等领域的专用变体。然而,其规模限制了在资源受限环境中的部署,且现有压缩方法往往在领域自适应后性能下降或需要大量额外计算。本文提出EfficientXpert——一种轻量级领域剪枝框架,其整合了ForeSight Mask(一种无需反向传播的传播感知权重剪枝准则)与Partial Brain Surgeon(在固定稀疏模式下对低秩适配器进行高效闭式更新的方法)。该框架仅需与标准LoRA相当的微调成本,即可通过单步剪枝将通用预训练模型转换为稀疏的领域自适应专家模型。在医疗和法律基准测试中,EfficientXpert在40%稀疏度下达到稠密模型性能的98%,优于现有剪枝基线,同时匹配LoRA的训练时间,且实验中的峰值GPU内存占用与LoRA相差不超过1%。