Large Language Models (LLMs) are increasingly employed in complex workflows, where different LLMs and fine-tuned variants collaboratively address complex tasks. However, these systems face significant inefficiencies due to redundant context processing of the shared context. We propose DroidSpeak, a framework that optimizes context sharing between fine-tuned LLMs derived from the same foundational model. DroidSpeak identifies critical layers in the KV cache and selectively recomputes them, enabling effective reuse of intermediate data while maintaining high accuracy. Our approach balances computational efficiency and task fidelity, significantly reducing inference latency and throughput bottlenecks. Experiments on diverse datasets and model pairs demonstrate that DroidSpeak achieves up to 3x higher throughputs and 2.6x faster prefill times with negligible accuracy loss compared to full recomputation.
翻译:大型语言模型(LLM)正日益应用于复杂工作流中,其中不同的LLM及其微调变体协同处理复杂任务。然而,由于对共享上下文进行冗余处理,此类系统存在显著的效率瓶颈。本文提出DroidSpeak框架,专门优化源自同一基础模型的微调LLM间的上下文共享机制。DroidSpeak通过识别KV缓存中的关键层级并选择性重计算,在保持高精度的同时实现中间数据的有效复用。该方法在计算效率与任务保真度间取得平衡,显著降低了推理延迟并缓解了吞吐量瓶颈。在多样化数据集和模型对上的实验表明,相较于完全重计算,DroidSpeak在精度损失可忽略的前提下,可实现最高3倍的吞吐量提升和2.6倍的前缀计算加速。