The context window of a large language model is not memory. It is L1 cache: a small, fast, expensive resource that the field treats as the entire memory system. There is no L2, no virtual memory, no paging. Every tool definition, every system prompt, and every stale tool result occupies context for the lifetime of the session. The result is measurable: across 857 production sessions and 4.45 million effective input tokens, 21.8% is structural waste. We present Pichay, a demand paging system for LLM context windows. Implemented as a transparent proxy between client and inference API, Pichay interposes on the message stream to evict stale content, detect page faults when the model re-requests evicted material, and pin working-set pages identified by fault history. In offline replay across 1.4 million simulated evictions, the fault rate is 0.0254%. In live production deployment over 681turns, the system reduces context consumption by up to 93% (5,038KB to 339KB); under extreme sustained pressure, the system remains operational but exhibits the expected thrashing pathology, with repeated fault-in of evicted content. The key observation is that the problems the field faces, such as context limits, attention degradation, cost scaling, lost state across sessions, are virtual memory problems wearing different clothes. The solutions exist: working set theory (Denning, 1968), demand paging, fault-driven replacement policies, and memory hierarchies with multiple eviction-managed levels. We describe the architecture of a full memory hierarchy for LLM systems (L1 through persistent storage), report on the first three levels deployed in production use (L1 eviction, L2 fault-driven pinning, L3 model-initiated conversation compaction), and identify cross-session memory as the remaining frontier.
翻译:大型语言模型的上下文窗口并非内存,而是L1缓存:一种被领域视为整个内存系统的小型、快速且昂贵的资源。当前系统缺乏L2缓存、虚拟内存和分页机制。每个工具定义、系统提示和过时的工具结果都会在整个会话周期内占据上下文空间。量化分析表明:通过对857个生产会话和445万个有效输入标记的统计,21.8%的上下文属于结构性浪费。本文提出Pichay——面向LLM上下文窗口的需求分页系统。该系统作为客户端与推理API间的透明代理,通过拦截消息流实现三大功能:驱逐陈旧内容、检测模型重新请求被驱逐材料时触发的页错误,以及根据错误历史记录锁定工作集页面。在140万次模拟驱逐的离线回放测试中,页错误率仅为0.0254%。在包含681轮对话的实际生产部署中,系统将上下文消耗降低达93%(从5,038KB降至339KB);在持续高压场景下,系统虽保持运行但表现出预期的颠簸现象,即被驱逐内容反复触发页错误调入。核心观点在于:当前领域面临的上下文限制、注意力衰减、成本攀升、跨会话状态丢失等问题,实则是虚拟内存问题在不同场景下的体现。解决方案早已存在:工作集理论(Denning, 1968)、需求分页、错误驱动的置换策略,以及多级驱逐管理的内存层次结构。本文完整阐述了LLM系统内存层次结构(L1至持久化存储)的架构设计,报告了已在生产环境中部署的前三级实现(L1驱逐、L2错误驱动锁定、L3模型发起的对话压缩),并指出跨会话内存管理是尚待攻克的前沿领域。