The growing demand for Large Language Models (LLMs) across diverse applications has prompted a paradigm shift in the design of deep learning serving systems. Deploying LLMs, especially in multi-tenant environments, presents considerable challenges due to their high computational and memory demands. We present BlockLLM, a serving system that exploits the potential of sharing components among fine-tuned LLM models to offer an efficient and flexible solution for LLM workloads. BlockLLM partitions the models into finer-grained blocks to enable the reuse of model components and independent provisioning to improve the computation efficiency. BlockLLM consists of an offline block zoo, for storing the blocks, and an online system to serve the requests through chains of blocks. It offers multi-fold flexibility: (1) Adaptive assembly of block chains on-the-fly is achieved with the help of equivalence evaluation among blocks in the zoo. (2) We enable per-block batch size and configure best-effort KV cache coordination at individual block level. (3) We adopt speculative execution and locality-aware block placement to mitigate the communication costs from dynamic block resource allocation. Our evaluation demonstrates that BlockLLM reduces memory and storage footprints and improves computation efficiency, outperforming existing serving approach in 95\%ile latency and GPU utilization by 33.5\% and 20.1\%, respectively.
翻译:随着大语言模型(LLM)在各类应用中的需求日益增长,深度学习服务系统的设计范式正在发生转变。在多租户环境中部署LLM,因其高计算和内存需求而面临巨大挑战。我们提出BlockLLM,一种通过利用微调LLM模型间组件共享潜力来提供高效灵活解决方案的服务系统。BlockLLM将模型划分为更细粒度的块,以实现模型组件的复用和独立部署,从而提升计算效率。该系统由存储块的离线块动物园和通过块链处理请求的在线系统组成,具有多重灵活性:(1)通过块动物园中块的等价性评估,实现块链的在线自适应组装;(2)在单个块级别启用每块批处理大小配置和尽力而为的键值缓存协调;(3)采用推测执行和局部感知块放置策略,以缓解动态块资源分配带来的通信开销。实验评估表明,BlockLLM能够降低内存与存储占用并提升计算效率,在95%延迟和GPU利用率上分别超越现有服务方法33.5%和20.1%。