Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-value (KV) caches that eliminate prefilling while preserving accuracy. A critical limitation of this approach is that cartridges are monolithic and non-compositional: encoding an entire collection into a single KV block does not scale, and naively mixing cartridges trained in isolation collapses performance to near chance. We introduce Cartridges at Scale (CAS), a training framework for scalable multi-cartridge learning with dynamic distractor mixing and a memory-efficient budget manager that rotates hundreds of per-document cartridges between GPU and persistent storage. Our approach scales to collections exceeding a million tokens, improving over a monolithic cartridge by 10-31 points at comparable token budgets. Oracle cartridge accuracy falls within 2-6 points of full in-context learning even at high compression. When paired with retrieval for cartridge selection, CAS matches or exceeds conventional RAG accuracy while consuming 3-4x fewer prompt tokens.
翻译:大语言模型能够处理长文本上下文,但预填充百万级词元会造成大量资源浪费,因为多数查询中的内容保持静态。缓存集成通过将文档集合蒸馏为可复用的键值(KV)缓存来解决此问题,既消除了预填充过程又保留了准确性。该方法的关键局限在于缓存集成具有单体性和非组合性:将整个集合编码为单个KV块无法扩展,而简单混合独立训练的缓存集成会使性能骤降至接近随机水平。我们提出大规模缓存集成(CAS)框架,该训练框架通过动态干扰物混合策略实现可扩展的多缓存集成学习,并配备内存高效的预算管理器,可在GPU和持久存储间轮换数百个单文档缓存。我们的方法可扩展至超百万词元的集合规模,在同等词元预算下比单体缓存集成提升10-31个性能点。即使在高压缩率下,理想缓存集成的准确率距完整上下文学习仅差2-6个点。当结合检索进行缓存选择时,CAS在消耗3-4倍更少提示词元的情况下达到或超越传统检索增强生成(RAG)的准确率。