Learned index structures aim to accelerate queries by training machine learning models to approximate the rank function associated with a database attribute. While effective in practice, their theoretical limitations are not fully understood. We present a general framework for proving lower bounds on query time for learned indexes, expressed in terms of their space overhead and parameterized by the model class used for approximation. Our formulation captures a broad family of learned indexes, including most existing designs, as piecewise model-based predictors. We solve the problem of lower bounding query time in two steps: first, we use probabilistic tools to control the effect of sampling when the database attribute is drawn from a probability distribution. Then, we analyze the approximation-theoretic problem of how to optimally represent a cumulative distribution function with approximators from a given model class. Within this framework, we derive lower bounds under a range of modeling and distributional assumptions, paying particular attention to the case of piecewise linear and piecewise constant model classes, which are common in practical implementations. Our analysis shows how tools from approximation theory, such as quantization and Kolmogorov widths, can be leveraged to formalize the space-time tradeoffs inherent to learned index structures. The resulting bounds illuminate core limitations of these methods.
翻译:学习索引结构旨在通过训练机器学习模型来近似数据库属性的秩函数,从而加速查询。尽管在实践中有效,但其理论局限性尚未得到充分理解。我们提出了一个通用框架,用于证明学习索引查询时间的下界,该下界以其空间开销表示,并以用于近似的模型类为参数。我们的框架将广泛的学习索引族(包括大多数现有设计)捕获为基于分段模型的预测器。我们通过两个步骤解决查询时间下界问题:首先,当数据库属性从概率分布中抽取时,我们使用概率工具来控制采样的影响;然后,我们分析如何用给定模型类中的近似器最优表示累积分布函数的逼近理论问题。在此框架内,我们在多种建模和分布假设下推导下界,特别关注分段线性和分段常数模型类(这些在实际实现中很常见)。我们的分析展示了如何利用逼近理论中的工具(如量化和Kolmogorov宽度)来形式化学习索引结构固有的空间-时间权衡。所得下界揭示了这些方法的核心局限性。