Leveraging Machine Learning to optimize database systems, referred to as Machine Learning for Databases (ML4DB, for short), dates back to the early 1990s, spanning indexing techniques, selectivity estimation, and query optimization. However, the idea has gained mainstream traction following the introduction of learned indexes in 2018, triggering a surge of research spanning learned indexes and cardinality estimators to learned query optimizers, storage layout design, resource management, and database tuning. The current ML4DB optimization landscape is dominated by narrow specialist ML models that are small and are trained on limited training data. Each specialist ML model targets a single database learning task on a fixed database engine, hardware platform, query workload, and optimization objective. As a result, they fall short in real-world settings, where these factors can vary significantly and evolve over time. This leads to an exponential number of ML models with limited portability and generalization capability, thus limiting the utility of existing ML4DB approaches. We address this limitation with Gen-DBA, a single general-purpose foundation model for optimizing databases with agentic capabilities. This paper presents the vision for Gen-DBA, provides a sketch design of how to realize it, and highlights several research challenges that need to be addressed to fully realize Gen-DBA.
翻译:利用机器学习优化数据库系统,即机器学习赋能数据库(简称ML4DB),可追溯至20世纪90年代初,其应用涵盖索引技术、选择性估计与查询优化等领域。然而,该理念在2018年学习索引技术提出后才获得主流关注,并引发研究热潮,其范围从学习索引与基数估计器扩展至学习型查询优化器、存储布局设计、资源管理与数据库调优。当前ML4DB优化领域主要由小型专用机器学习模型主导,这些模型基于有限训练数据训练而成。每个专用机器学习模型仅针对固定数据库引擎、硬件平台、查询负载及优化目标下的单一数据库学习任务。因此,在实际应用场景中——这些因素往往存在显著差异且随时间动态变化——此类模型表现欠佳。这导致机器学习模型数量呈指数级增长,且可移植性与泛化能力有限,从而制约了现有ML4DB方法的实用性。我们通过Gen-DBA应对这一局限,该模型是具备智能体能力的通用基础模型,可用于数据库优化。本文阐述Gen-DBA的愿景框架,勾勒其实现路径的设计概要,并指出实现Gen-DBA所需突破的若干研究挑战。