Traditional query optimization relies on cost-based optimizers that estimate execution cost (e.g., runtime, memory, and I/O) using predefined heuristics and statistical models. Improving these heuristics requires substantial engineering effort, and even when implemented, these heuristics often cannot take into account semantic correlations in queries and schemas that could enable better physical plans. Using our DBPlanBench harness for the DataFusion engine, we expose the physical plan through a compact serialized representation and let the LLM propose localized edits that can be applied and executed. We then apply an evolutionary search over these edits to refine candidates across iterations. Our key insight is that LLMs can leverage semantic knowledge to identify and apply non-obvious optimizations, such as join orderings that minimize intermediate cardinalities. We obtain up to 4.78$\times$ speedups on some queries and we demonstrate a small-to-large workflow in which optimizations found on small databases transfer effectively to larger databases.
翻译:传统查询优化依赖于基于成本的优化器,这些优化器使用预定义的启发式规则和统计模型来估计执行成本(如运行时间、内存和I/O)。改进这些启发式规则需要大量的工程努力,即使实现后,这些规则通常也无法考虑查询和模式中的语义关联,而这些关联本可实现更优的物理执行计划。我们利用为DataFusion引擎开发的DBPlanBench框架,通过紧凑的序列化表示形式暴露物理计划,并让LLM提出可应用和执行的局部编辑建议。随后,我们应用进化搜索对这些编辑进行迭代优化,以改进候选方案。我们的核心洞见在于,LLM能够利用语义知识识别并应用非显而易见的优化策略,例如最小化中间结果集基数的连接顺序。实验表明,部分查询实现了高达4.78$\times$的加速效果,同时我们展示了一种从小规模到大规模的工作流程,即在小规模数据库上发现的优化策略能够有效迁移至大规模数据库。