As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries. Thus, enabling flexible and efficient injection of new knowledge in LLM inference is critical. Three high-level options exist: (i) embedding the knowledge in LLM's weights (i.e., fine-tuning), (ii) including the knowledge as a part of LLM's text input (i.e., in-context learning), or (iii) injecting the KV caches of the new knowledge to LLM during prefill. This paper argues that, although fine-tuning and in-context learning are popular, using KV caches as the medium of knowledge could simultaneously enable more modular management of knowledge injection and more efficient LLM serving with low cost and fast response. To realize these benefits, we envision a Knowledge Delivery Network (KDN), a new system component in LLM services that dynamically optimizes the storage, transfer, and composition of KV cache across LLM engines and other compute and storage resources. We believe that, just like content delivery networks (CDNs), such as Akamai, enabled the success of the Internet ecosystem through their efficient data delivery, KDNs will be critical to the success of LLM applications through their efficient knowledge delivery. We have open-sourced a KDN prototype at https://github.com/LMCache/LMCache.
翻译:随着大型语言模型(LLM)的快速普及,为补充各种LLM查询所需的知识范围也在不断扩大。因此,在LLM推理过程中实现新知识的灵活高效注入至关重要。目前存在三种高层级方案:(i)将知识嵌入LLM的权重中(即微调),(ii)将知识作为LLM文本输入的一部分(即上下文学习),或(iii)在预填充阶段将新知识的KV缓存注入LLM。本文认为,尽管微调和上下文学习较为流行,但使用KV缓存作为知识媒介,可以同时实现更模块化的知识注入管理,以及更低成本、更快响应速度的高效LLM服务。为实现这些优势,我们构想了一种知识分发网络(KDN),它是LLM服务中的一个新型系统组件,能够动态优化KV缓存在LLM引擎及其他计算和存储资源之间的存储、传输与组合。我们相信,正如Akamai等内容分发网络(CDN)通过高效的数据分发推动了互联网生态系统的成功,KDN也将通过高效的知识分发,对LLM应用的成功至关重要。我们已在https://github.com/LMCache/LMCache开源了KDN原型。