Residual-Entropy Accounting for Routed Atom-Budgeted Learned Indexes

We study exact predecessor and rank search in a routed, atom-budgeted, certified-repair learned-index architecture. An ordered directory routes each query to a contiguous interval, a counted local predictor returns a certified rank window, and exact repair resolves the remaining uncertainty by comparisons. The result is scoped to this architecture and does not claim guarantees for arbitrary learned-index designs such as unconstrained RMI dispatch, hash routing, neural routing, or exact-payload indexes without additional accounting. The main parameter is conditional residual answer entropy: the entropy of the exact answer after the leaf, predictor output, certificate, and charged pre-repair information are observed. We prove a two-sided accounting theorem showing that this functional gives the query-time scale under the stated architecture and local predictor-atom budget. Directory space, sorted-array storage, and transcript-indexed repair-program space are treated as separate system costs, so the theorem is not a byte-level space lower bound or a compact implementation recipe. We also give a rank-spread specialization in which the radius term log(1 + Delta) is valid only when many residual ranks remain likely after the predictor transcript is known. For counted piecewise-linear segments, we make the profile term non-oracular, derive a shadow-price allocation rule, compute finite-instance RGapM and GapM values on real SOSD and Zenodo samples, and report benchmarks against PGM-index, RadixSpline, and binary search. The benchmarks expose overheads and bottlenecks rather than claiming speed for the shadow prototype.

翻译：我们研究了在具有路由、原子预算和可验证修复能力的学习索引架构中的精确前驱与排名搜索问题。该架构通过有序目录将每个查询路由至连续区间，计数型局部预测器返回经过验证的排名窗口，精确修复步骤通过比较操作消除剩余不确定性。本结论仅限于该架构，不适用于无约束RMI调度、哈希路由、神经路由或未包含额外核算的精确键索引等其他学习索引设计。核心参数为条件残差答案熵：即在已知叶子节点、预测器输出、验证证书及已计费的预修复信息后，精确答案的熵值。我们证明了双向核算定理，表明在该架构与局部预测器原子预算约束下，该函数决定了查询时间复杂度。目录存储空间、有序数组存储空间及基于日志索引的修复程序空间被视作独立的系统开销，因此该定理并非字节级空间下界或紧凑实现方案。我们还给出了排名区间特例化结果：其中半径项log(1+Δ)仅在预测器日志已知后残差排名仍具有高似然性时有效。针对计数型分段线性模型，我们消除了剖面项的黑箱假设，推导出影子价格分配规则，在真实SOSD和Zenodo数据集上计算有限实例的RGapM与GapM值，并报告了与PGM-index、RadixSpline及二分查找的基准对比。基准测试揭示了该影子原型的开销与瓶颈，而非宣称其速度优势。