Query optimization is a pivotal part of every database management system (DBMS) since it determines the efficiency of query execution. Numerous works have introduced Machine Learning (ML) techniques to cost modeling, cardinality estimation, and end-to-end learned optimizer, but few of them are proven practical due to long training time, lack of interpretability, and integration cost. A recent study provides a practical method to optimize queries by recommending per-query hints but it suffers from two inherited problems. First, it follows the regression framework to predict the absolute latency of each query plan, which is very challenging because the latencies of query plans for a certain query may span multiple orders of magnitude. Second, it requires training a model for each dataset, which restricts the application of the trained models in practice. In this paper, we propose COOOL to predict Cost Orders of query plans to cOOperate with DBMS by Learning-To-Rank. Instead of estimating absolute costs, COOOL uses ranking-based approaches to compute relative ranking scores of the costs of query plans. We show that COOOL is theoretically valid to distinguish query plans with different latencies. We implement COOOL on PostgreSQL, and extensive experiments on join-order-benchmark and TPC-H data demonstrate that COOOL outperforms PostgreSQL and state-of-the-art methods on single-dataset tasks as well as a unified model for multiple-dataset tasks. Our experiments also shed some light on why COOOL outperforms regression approaches from the representation learning perspective, which may guide future research.
翻译:查询优化是每个数据库管理系统(DBMS)的关键组成部分,因为它决定了查询执行效率。已有许多工作将机器学习技术引入成本建模、基数估计和端到端学习优化器,但由于训练时间长、缺乏可解释性以及集成成本高等问题,极少有方法被证明具有实用性。近期一项研究通过推荐逐查询提示来提供实用的查询优化方法,但该方法存在两个固有问题:首先,它采用回归框架预测每个查询计划的绝对延迟,这一任务极具挑战性,因为同一查询的不同查询计划的延迟可能跨越多个数量级;其次,它需要为每个数据集单独训练模型,这限制了训练模型在实际中的应用。本文提出COOOL(通过学习排序与DBMS协作预测查询计划成本顺序)方法。COOOL不直接估计绝对成本,而是采用基于排序的方法计算查询计划成本的相对排序分数。我们证明了COOOL在理论上能够有效区分具有不同延迟的查询计划。我们在PostgreSQL上实现了COOOL,并在join-order-benchmark和TPC-H数据集上进行了大量实验,结果表明:在单数据集任务中,COOOL优于PostgreSQL及现有最优方法;在多数据集任务中,COOOL可构建统一模型。我们的实验还从表示学习的角度揭示了COOOL优于回归方法的原因,这可为未来研究提供指导。