Evaluating the Practical Effectiveness of LLM-Driven Index Tuning with Microsoft Database Tuning Advisor

Index tuning is critical for the performance of modern database systems. Industrial index tuners, such as the Database Tuning Advisor (DTA) developed for Microsoft SQL Server, rely on the "what-if" API provided by the query optimizer to estimate the cost of a query given an index configuration, which can lead to suboptimal recommendations when the estimations are inaccurate. Large language model (LLM) offers a new approach to index tuning, with knowledge learned from web-scale training datasets. However, the effectiveness of LLM-driven index tuning, especially beyond what is already achieved by commercial index tuners, remains unclear. In this paper, we study the practical effectiveness of LLM-driven index tuning using both industrial benchmarks and real-world enterprise customer workloads, and compare it with DTA. Our results show that although DTA is generally more reliable, with a few invocations, LLM can identify configurations that significantly outperform those found by DTA in execution time in a considerable number of cases, highlighting its potential as a complementary technique. We also observe that LLM's reasoning captures human-intuitive insights that may be distilled to potentially improve DTA. However, adopting LLM-driven index tuning in production remains challenging due to its substantial performance variance, limited and often negative impact when directly integrated into DTA, and the high cost of performance validation. This work provides motivation, lessons, and practical insights that will inspire future work on LLM-driven index tuning both in academia and industry.

翻译：索引调优对现代数据库系统的性能至关重要。工业级索引调优器（如为Microsoft SQL Server开发的数据库调优顾问DTA）依赖查询优化器提供的"假设分析"API来评估给定索引配置下的查询成本，当估算不准确时可能导致次优推荐。大型语言模型（LLM）通过从网络规模训练数据集中学习知识，为索引调优提供了新途径。然而，LLM驱动的索引调优效果，特别是相对于商业索引调优器已实现成果的增益，尚不明确。本文通过工业基准测试和真实企业客户工作负载，研究LLM驱动索引调优的实际效果，并与DTA进行对比。结果表明：尽管DTA通常更可靠，但在相当多案例中，LLM通过少量调用即可找到执行时间显著优于DTA的配置，凸显其作为补充技术的潜力。我们还观察到LLM的推理过程能捕捉人类直觉的洞见，这些洞见可能被提炼用于改进DTA。然而，在生产环境中采用LLM驱动索引调优仍面临挑战，包括其显著的性能波动性、直接集成至DTA时有限且常为负面的影响，以及高昂的性能验证成本。本研究提供了动机、经验与实践见解，将为学术界和工业界未来关于LLM驱动索引调优的研究提供启发。