Use of machine learning to perform database operations, such as indexing, cardinality estimation, and sorting, is shown to provide substantial performance benefits. However, when datasets change and data distribution shifts, empirical results also show performance degradation for learned models, possibly to worse than non-learned alternatives. This, together with a lack of theoretical understanding of learned methods undermines their practical applicability, since there are no guarantees on how well the models will perform after deployment. In this paper, we present the first known theoretical characterization of the performance of learned models in dynamic datasets, for the aforementioned operations. Our results show novel theoretical characteristics achievable by learned models and provide bounds on the performance of the models that characterize their advantages over non-learned methods, showing why and when learned models can outperform the alternatives. Our analysis develops the distribution learnability framework and novel theoretical tools which build the foundation for the analysis of learned database operations in the future.
翻译:利用机器学习执行数据库操作(如索引、基数估计和排序)已被证明能带来显著的性能提升。然而,当数据集发生变化且数据分布发生偏移时,实证结果也表明学习型模型的性能会下降,甚至可能劣于非学习方法。这一问题,加之对学习方法缺乏理论理解,削弱了其实用性,因为无法保证模型部署后的性能表现。本文首次对上述操作中学习型模型在动态数据集上的性能进行了理论刻画。我们的研究结果揭示了学习型模型可实现的新颖理论特性,并为模型性能提供了界定其相对于非学习方法优势的边界,从而阐明了学习型模型为何及何时能够超越替代方案。我们的分析发展了分布可学习性框架和新的理论工具,为未来分析学习型数据库操作奠定了理论基础。