Machine learning models have demonstrated substantial performance enhancements over non-learned alternatives in various fundamental data management operations, including indexing (locating items in an array), cardinality estimation (estimating the number of matching records in a database), and range-sum estimation (estimating aggregate attribute values for query-matched records). However, real-world systems frequently favor less efficient non-learned methods due to their ability to offer (worst-case) error guarantees - an aspect where learned approaches often fall short. The primary objective of these guarantees is to ensure system reliability, ensuring that the chosen approach consistently delivers the desired level of accuracy across all databases. In this paper, we embark on the first theoretical study of such guarantees for learned methods, presenting the necessary conditions for such guarantees to hold when using machine learning to perform indexing, cardinality estimation and range-sum estimation. Specifically, we present the first known lower bounds on the model size required to achieve the desired accuracy for these three key database operations. Our results bound the required model size for given average and worst-case errors in performing database operations, serving as the first theoretical guidelines governing how model size must change based on data size to be able to guarantee an accuracy level. More broadly, our established guarantees pave the way for the broader adoption and integration of learned models into real-world systems.
翻译:机器学习模型在多种基础数据管理操作中已展现出优于非学习型方法的显著性能提升,这些操作包括索引(在数组中定位项)、基数估计(估计数据库中匹配记录的数量)以及范围求和估计(估计查询匹配记录的聚合属性值)。然而,实际系统往往更倾向于采用效率较低的非学习方法,因为它们能够提供(最坏情况下的)误差保证——而这正是学习方法通常欠缺的方面。这些保证的主要目标是确保系统可靠性,即所选方法在所有数据库上始终能够达到期望的准确度水平。在本文中,我们首次对学习型方法的此类保证进行了理论研究,提出了在使用机器学习执行索引、基数估计和范围求和估计时此类保证成立的必要条件。具体而言,我们首次给出了实现这三种关键数据库操作所需准确度时模型大小的已知下界。我们的结果限定了在给定平均误差和最坏情况误差下执行数据库操作所需的模型大小,这为模型大小必须如何根据数据规模变化以保证准确度水平提供了首个理论指导。更广泛地说,我们建立的保证为学习模型在实际系统中更广泛的采用和集成铺平了道路。