Cardinality estimation (CardEst) is essential for optimizing query execution plans. Recent ML-based CardEst methods achieve high accuracy but face deployment challenges due to high preparation costs and lack of transferability across databases. In this paper, we propose PRICE, a PRetrained multI-table CardEst model, which addresses these limitations. PRICE takes low-level but transferable features w.r.t. data distributions and query information and elegantly applies self-attention models to learn meta-knowledge to compute cardinality in any database. It is generally applicable to any unseen new database to attain high estimation accuracy, while its preparation cost is as little as the basic one-dimensional histogram-based CardEst methods. Moreover, PRICE can be finetuned to further enhance its performance on any specific database. We pretrained PRICE using 30 diverse datasets, completing the process in about 5 hours with a resulting model size of only about 40MB. Evaluations show that PRICE consistently outperforms existing methods, achieving the highest estimation accuracy on several unseen databases and generating faster execution plans with lower overhead. After finetuning with a small volume of databasespecific queries, PRICE could even find plans very close to the optimal ones. Meanwhile, PRICE is generally applicable to different settings such as data updates, data scaling, and query workload shifts. We have made all of our data and codes publicly available at https://github.com/StCarmen/PRICE.
翻译:基数估计(CardEst)对于优化查询执行计划至关重要。近期基于机器学习的CardEst方法虽能达到较高精度,但由于准备成本高昂且缺乏跨数据库的可迁移性,面临实际部署挑战。本文提出PRICE,一种预训练的多表基数估计模型,以应对这些局限。PRICE采用与数据分布及查询信息相关的低层次可迁移特征,并巧妙应用自注意力模型学习元知识,从而计算任意数据库的基数。该模型可普遍适用于任何未见过的数据库,并获得高估计精度,而其准备成本仅与基于一维直方图的基础CardEst方法相当。此外,PRICE可通过微调进一步提升其在特定数据库上的性能。我们使用30个多样化数据集对PRICE进行预训练,整个过程耗时约5小时,所得模型大小仅约40MB。评估结果表明,PRICE在多个未见数据库上持续优于现有方法,实现了最高的估计精度,并能以更低开销生成更快的执行计划。经过少量数据库特定查询微调后,PRICE甚至能生成接近最优的执行计划。同时,PRICE可普遍适用于数据更新、数据缩放及查询负载变化等不同场景。我们已将全部数据与代码公开于https://github.com/StCarmen/PRICE。