When serving a single base LLM with several different LoRA adapters simultaneously, the adapters cannot simply be merged with the base model's weights as the adapter swapping would create overhead and requests using different adapters could not be batched. Rather, the LoRA computations have to be separated from the base LLM computations, and in a multi-device setup the LoRA adapters can be sharded in a way that is well aligned with the base model's tensor parallel execution, as proposed in S-LoRA. However, the S-LoRA sharding strategy encounters some communication overhead, which may be small in theory, but can be large in practice. In this paper, we propose to constrain certain LoRA factors to be block-diagonal, which allows for an alternative way of sharding LoRA adapters that does not require any additional communication for the LoRA computations. We demonstrate in extensive experiments that our block-diagonal LoRA approach is similarly parameter efficient as standard LoRA (i.e., for a similar number of parameters it achieves similar downstream performance) and that it leads to significant end-to-end speed-up over S-LoRA. For example, when serving on eight A100 GPUs, we observe up to 1.79x (1.23x) end-to-end speed-up with 0.87x (1.74x) the number of adapter parameters for Llama-3.1-70B, and up to 1.63x (1.3x) end-to-end speed-up with 0.86x (1.73x) the number of adapter parameters for Llama-3.1-8B.
翻译:当使用多个不同的LoRA适配器同时服务单个基础大语言模型时,适配器不能简单地与基础模型的权重合并,因为适配器切换会产生开销,且使用不同适配器的请求无法进行批处理。相反,LoRA计算必须与基础大语言模型的计算分离;在多设备设置中,如S-LoRA所提出的,LoRA适配器可以以一种与基础模型张量并行执行良好对齐的方式进行分片。然而,S-LoRA的分片策略会遇到一些通信开销,这些开销在理论上可能很小,但在实践中可能很大。在本文中,我们提出约束某些LoRA因子为块对角矩阵,这允许一种替代的LoRA适配器分片方式,该方式无需为LoRA计算进行任何额外的通信。我们在大量实验中证明,我们的块对角LoRA方法具有与标准LoRA相似的参数效率(即,在参数数量相近时,其下游性能相似),并且相比S-LoRA能带来显著的端到端加速。例如,在八块A100 GPU上进行服务时,对于Llama-3.1-70B,我们观察到端到端加速最高达1.79倍(1.23倍),而适配器参数数量为0.87倍(1.74倍);对于Llama-3.1-8B,端到端加速最高达1.63倍(1.3倍),而适配器参数数量为0.86倍(1.73倍)。