Cardinality estimation (CardEst) still remains a challenging problem for DBMS. Recent years have witnessed the success of ML-based cardinality estimators in outperforming traditional methods. However, these solutions suffer from poor generalizability to new data or query distribution, inability to handle complex queries, and substantial data preparation overhead, thus preventing their wide adoption in the real-world DBMS. Some recent efforts have been dedicated to addressing some but not all of these issues. We notice that the recent emerging Large Language Models (LLMs) have shown their remarkable generalizability to unseen tasks, capabilities to understand complex programs, and power to perform data-efficient fine-tuning. In light of this, we propose to leverage LLMs to mitigate the above issues. Specifically, we carefully craft prompts, and subsequently perform fine-tuning and self-correction during inference with LLMs for CardEst task. We then extensively evaluate LLMs' in-distribution and out-of-distribution generalizability, feasibility to support complex queries, and training data efficiency during fine-tuning LLMs on pre-training datasets. The results suggest that LLMs outperform the state-of-the-art in almost all settings, thus indicating their potential for the CardEst task. We further measure the end-to-end query execution time in DBMS by using the estimated cardinalities of LLMs in some practical settings, which suggests that the inference overhead of LLMs can be outweighed by the benefits brought by LLMs for CardEst.
翻译:基数估计(CardEst)仍然是数据库管理系统(DBMS)中一个具有挑战性的问题。近年来,基于机器学习的基数估计器在超越传统方法方面取得了成功。然而,这些解决方案存在对新数据或查询分布的泛化能力差、无法处理复杂查询以及数据准备开销大等问题,因而阻碍了它们在真实世界DBMS中的广泛采用。近期的一些研究工作致力于解决上述部分问题,但未能全面应对。我们注意到,新兴的大型语言模型(LLMs)在未见任务上展现出卓越的泛化能力、理解复杂程序的能力以及高效数据微调的能力。鉴于此,我们提出利用LLMs来缓解上述问题。具体而言,我们精心设计提示(prompts),随后对LLMs进行微调,并在推理过程中针对CardEst任务进行自我修正。我们随后全面评估了LLMs在分布内和分布外的泛化能力、支持复杂查询的可行性,以及在预训练数据集上微调时的训练数据效率。结果表明,LLMs在几乎所有设置下均优于现有最先进方法,从而展示了其在CardEst任务中的潜力。我们进一步通过在实际场景中使用LLMs估计的基数来测量DBMS中端到端的查询执行时间,结果表明,LLMs用于CardEst带来的效益可以超越其推理开销。