Large Multimodal Models (LMMs) have shown significant progress in various complex vision tasks with the solid linguistic and reasoning capacity inherited from large language models (LMMs). Low-rank adaptation (LoRA) offers a promising method to integrate external knowledge into LMMs, compensating for their limitations on domain-specific tasks. However, the existing LoRA model serving is excessively computationally expensive and causes extremely high latency. In this paper, we present an end-to-end solution that empowers diverse vision tasks and enriches vision applications with LoRA LMMs. Our system, VaLoRA, enables accurate and efficient vision tasks by 1) an accuracy-aware LoRA adapter generation approach that generates LoRA adapters rich in domain-specific knowledge to meet application-specific accuracy requirements, 2) an adaptive-tiling LoRA adapters batching operator that efficiently computes concurrent heterogeneous LoRA adapters, and 3) a flexible LoRA adapter orchestration mechanism that manages application requests and LoRA adapters to achieve the lowest average response latency. We prototype VaLoRA on five popular vision tasks on three LMMs. Experiment results reveal that VaLoRA improves 24-62% of the accuracy compared to the original LMMs and reduces 20-89% of the latency compared to the state-of-the-art LoRA model serving systems.
翻译:大型多模态模型(LMMs)凭借从大型语言模型(LLMs)继承的强大语言与推理能力,已在各类复杂视觉任务中展现出显著进展。低秩自适应(LoRA)提供了一种将外部知识整合到LMM中的有效方法,弥补了其在特定领域任务上的局限性。然而,现有的LoRA模型服务计算开销过大,导致极高的延迟。本文提出一种端到端解决方案,通过LoRA LMM赋能多样化视觉任务并丰富视觉应用。我们的系统VaLoRA通过以下方式实现准确高效的视觉任务:1)一种精度感知的LoRA适配器生成方法,可生成富含领域知识的LoRA适配器以满足应用特定的精度需求;2)一种自适应分片的LoRA适配器批处理算子,能高效计算并发的异构LoRA适配器;3)一种灵活的LoRA适配器编排机制,通过管理应用请求与LoRA适配器实现最低的平均响应延迟。我们在三种LMM上针对五种主流视觉任务对VaLoRA进行原型实现。实验结果表明,相较于原始LMM,VaLoRA将精度提升了24-62%;相较于最先进的LoRA模型服务系统,其延迟降低了20-89%。