Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While LoRA offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request. To mitigate this constraint, we introduce Fast LoRA (FLoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that FLoRA retains the performance merits of LoRA, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 8 languages and a multilingual speech recognition task across 6 languages.
翻译:低秩自适应(LoRA)近期因通过引入可训练的低秩矩阵来微调基础模型而受到关注,从而减少了可训练参数的数量。尽管LoRA具有众多优势,但其在面向多样化全球用户群提供实时服务时,因无法高效处理多个特定任务适配器而受限。这在需要为每个传入请求进行个性化、特定任务适配的场景中造成了性能瓶颈。为解决这一限制,我们提出了快速LoRA(FLoRA)框架,该框架允许小批量中的每个输入样本关联其独特的低秩自适应权重,从而实现对异构请求的高效批处理。我们通过实验证明,FLoRA保留了LoRA的性能优势,在涵盖8种以上语言的MultiPL-E代码生成基准测试以及跨6种语言的多语言语音识别任务中展现出具有竞争力的结果。