LoRA enables efficient customization of LLMs and is widely used in multi-tenant and multi-task serving. However, emerging model architectures such as MoE significantly increase LoRA memory cost, making existing coupled LoRA serving designs poorly scalable and prone to tail-latency inflation. We present InfiniLoRA, a disaggregated LoRA serving system that decouples LoRA execution from base-model inference. InfiniLoRA introduces a shared LoRA Server with parallelism-aware execution, SLO-driven provisioning, and critical-path optimizations, including GPU-initiated communication and hardware-specialized LoRA kernels. Experiments show that InfiniLoRA can achieve an average $3.05\times$ increase in serviceable request rate under strict latency SLOs, and improve the percentage of LoRA adapters satisfying the SLO requirement by 54.0\%.
翻译:LoRA技术实现了对大语言模型的高效定制,广泛应用于多租户与多任务服务场景。然而,以MoE为代表的新型模型架构显著增加了LoRA的内存开销,导致现有耦合式LoRA服务设计可扩展性差且易于出现尾延迟膨胀问题。本文提出InfiniLoRA——一种解耦式LoRA服务系统,将LoRA执行从基座模型推理中分离。InfiniLoRA引入了共享LoRA服务器,包含并行感知执行机制、SLO驱动的资源供给策略以及关键路径优化方案(包括GPU发起通信与硬件专用LoRA内核)。实验表明,InfiniLoRA在严格延迟SLO约束下可实现服务请求率的平均3.05倍提升,并使满足SLO要求的LoRA适配器比例提升54.0%。