We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.
翻译:我们提出KV-Fold,一种简单的、无需训练的长上下文推理协议,该协议将键值(KV)缓存视为序列块上左折叠(left fold)中的累加器。在每一步中,模型基于累积的缓存处理下一个块,追加新生成的键和值,并将扩大的缓存向前传递;重复应用相同的单步更新,类似于函数式编程中的foldl操作。基于为潜在多智能体通信引入的KV缓存拼接原语,我们将其重新用作长上下文推理中的块到块递归。当处理第t个块时,模型将之前块携带的KV缓存作为前缀进行注意力计算,跨片段复用其内部状态,无需修改或重新训练模型。尽管其简洁性,所引发的递归是稳定的:每步漂移短暂上升后饱和至一个平坦平台,该平台在深层链中持续存在。该平台对数值精度万倍变化不敏感,对不同块大小具有鲁棒性,且跨模型家族保持一致。在任务层面,KV-Fold能在长距离上保留精确信息。在“大海捞针”基准测试中,它使用Llama-3.1-8B模型,在涵盖16K至128K令牌上下文及链深达511的152次实验中实现100%精确匹配检索,同时保持在单块40GB GPU的内存限制内。与以保真度换取有界内存的流式方法相比,KV-Fold在作为一系列可处理的前向传播运行时,能维持长距离检索。总体而言,我们的结果表明,冻结的预训练Transformer已支持一种稳定的KV缓存递归形式,无需架构修改或训练即可为长上下文推理提供实用途径。