Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.
翻译:多LLM系统利用多样化大型语言模型的互补优势,实现了单一模型无法达到的性能和效率提升。在现有设计中,LLM通过文本进行通信,迫使内部表示必须转换为输出词元序列。这一过程既丢失了丰富的语义信息,又带来了逐词元生成的延迟。基于这些局限性,我们提出:LLM能否超越文本进行通信?基准实验表明,增强KV-Cache语义可以在不增加缓存大小的情况下提升响应质量,这支持了KV-Cache作为模型间通信有效媒介的可行性。为此,我们提出缓存到缓存(C2C),一种LLM间直接语义通信的新范式。C2C使用神经网络对源模型的KV-Cache进行投影并与目标模型的KV-Cache融合,从而实现直接语义传递。可学习的门控机制会选择能从缓存通信中受益的目标层。与文本通信相比,C2C充分利用了两个模型的深层专有语义,同时避免了显式的中间文本生成。实验表明,C2C比单一模型实现了8.5-10.5%的平均准确率提升。相较于文本通信范式,其性能进一步提高了约3.0-5.0%,同时延迟平均加速2.0倍。我们的代码公开于https://github.com/thu-nics/C2C。