Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently reduces perplexity and improves performance across a range of reasoning-intensive tasks.
翻译:使大型语言模型(LLM)能够通过生成并关注中间推理步骤来“多思考”的技术,已在解决复杂问题方面展现出潜力。然而,标准方法在生成响应前立即生成离散标记序列,因此可能产生显著的延迟成本且难以优化。在本工作中,我们证明可以通过一个离线协处理器来增强一个冻结的LLM,该协处理器操作于模型的键值(kv)缓存之上。此协处理器利用一组旨在提升后续解码保真度的潜在嵌入来增强缓存。我们使用解码器在标准预训练数据上的语言建模损失来训练此协处理器,同时保持解码器本身冻结。这种方法使得模型能够以端到端可微分的方式,学习如何将额外的计算提炼到其kv缓存中。由于解码器保持不变,协处理器可以离线且异步地运行,并且如果协处理器不可用或判定某个缓存无需额外计算,语言模型仍可正常运作。实验表明,当缓存被增强后,解码器在后续多个标记上实现了更低的困惑度。此外,即使没有任何任务特定的训练,我们的实验也证明缓存增强能持续降低困惑度,并在一系列推理密集型任务中提升性能。