缓存到缓存：大型语言模型间的直接语义通信 (Cache-to-Cache: Direct Semantic Communication Between Large Language Models)

Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

翻译：多LLM系统利用多样化大型语言模型的互补优势，实现了单一模型无法达到的性能和效率提升。在现有设计中，LLM通过文本进行通信，迫使内部表示必须转换为输出词元序列。这一过程既丢失了丰富的语义信息，又带来了逐词元生成的延迟。基于这些局限性，我们提出：LLM能否超越文本进行通信？基准实验表明，增强KV-Cache语义可以在不增加缓存大小的情况下提升响应质量，这支持了KV-Cache作为模型间通信有效媒介的可行性。为此，我们提出缓存到缓存（C2C），一种LLM间直接语义通信的新范式。C2C使用神经网络对源模型的KV-Cache进行投影并与目标模型的KV-Cache融合，从而实现直接语义传递。可学习的门控机制会选择能从缓存通信中受益的目标层。与文本通信相比，C2C充分利用了两个模型的深层专有语义，同时避免了显式的中间文本生成。实验表明，C2C比单一模型实现了8.5-10.5%的平均准确率提升。相较于文本通信范式，其性能进一步提高了约3.0-5.0%，同时延迟平均加速2.0倍。我们的代码公开于https://github.com/thu-nics/C2C。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日