Streaming Knowledge Compilation: Proactive Materiality-Scored Pinning for Time-Evolving LLM Wikis

LLM wiki systems compile knowledge into pre-filled KV caches for efficient inference, but assume a static corpus -- an assumption that fails whenever the underlying information landscape evolves. We formalize Streaming Knowledge Compilation: given a document stream, a fixed token budget, and future queries unknown at ingestion time, maintain a compiled wiki that minimizes cumulative regret against an offline oracle with perfect foresight. The enabling insight is a materiality signal $φ_t(k,n)\in[0,1]$ that scores document importance for entity $k$ at time $t$, acting as a query-relevance surrogate for proactive pinning before queries arrive; we prove an $O(\sqrt{T\log K})$ regret bound where $\varepsilon=\mathbb{E}[|φ_t-\hatφ_t|]$ is the only domain-specific quantity. We instantiate in two domains: finance, where $φ_t$ is abnormal stock volatility predicted by frozen Llama 3.1 8B classification head (AUROC = 0.728 on 76K articles, strict temporal split; $1.49\times$ higher realized forward volatility for predicted-material articles); and Wikipedia, where $φ_t$ is the Abnormal Edit Ratio (AER), a cross-sectionally normalized edit velocity -- showing the same algorithm generalizes beyond the finance domain. End-to-end QA evaluation on 173 matched pairs (finance) and 119 (Wikipedia) reveals a pervasive LLM-as-judge confound on post-training knowledge, establishing that regret analysis -- not absolute QA scores -- is the reliable evaluation metric for compiled knowledge systems. Finance cumulative regret converges to -20.0 (-0.12/step); Wikipedia to +16.0 (+0.13/step), with the positive sign confirming that Wikipedia edit content is genuinely post-training -- richer context consistently improves scores (No Wiki 3.80 vs. Oracle 4.74) -- and eliminates this confound. The $O(\sqrt{T\log K})$ guarantee applies to any domain where knowledge gaps can be predicted from streaming signals.

翻译：大型语言模型维基系统将知识预填充至键值缓存以实现高效推理，但其基于静态语料库的假设在底层信息空间持续演化时失效。我们形式化定义了流式知识编译：给定文档流、固定词元预算以及注入时未知的未来查询，维护一个编译后的维基系统，使累计遗憾相对于具有完美先见性的离线最优者最小化。核心洞见在于素材性信号$φ_t(k,n)\in[0,1]$——该指标对实体$k$在时间$t$的文档重要性进行评分，在查询到达前作为查询相关性代理实现主动固定；我们证明了$O(\sqrt{T\log K})$的遗憾界，其中$\varepsilon=\mathbb{E}[|φ_t-\hatφ_t|]$是唯一的领域特定量。我们在两个领域进行实例化：金融领域中，$φ_t$由冻结的Llama 3.1 8B分类头预测的异常股票波动率（76K文章上AUROC=0.728，严格时间分割；预测为素材性文章的实际前向波动率高1.49倍）；维基百科领域中，$φ_t$为异常编辑比——一种横截面标准化的编辑速度指标——证明相同算法可泛化至金融领域之外。基于173对（金融）和119对（维基百科）的端到端问答评估揭示了训练后知识上普遍存在的大模型作为裁判混淆效应，从而确立遗憾分析（而非绝对问答分数）是编译知识系统的可靠评估指标。金融累计遗憾收敛至-20.0（-0.12/步）；维基百科收敛至+16.0（+0.13/步），正号证实维基百科编辑内容确实属于训练后知识——更丰富的上下文持续提升得分（无维基系统3.80 vs 最优者4.74）——并消除了该混淆效应。$O(\sqrt{T\log K})$保证适用于任何可通过流式信号预测知识缺口的领域。