Learned Query Optimizer in Alibaba MaxCompute: Challenges, Analysis, and Solutions

Existing learned query optimizers remain ill-suited to modern distributed, multi-tenant data warehouses due to idealized modeling assumptions and design choices. Using Alibaba's MaxCompute as a representative, we surface four fundamental, system-agnostic challenges for any deployable learned query optimizer: 1) highly dynamic execution environments that induce large variance in plan costs; 2) potential absence of input statistics needed for cost estimation; 3) infeasibility of conventional model refinement; and 4) uncertain benefits across different workloads. These challenges expose a deep mismatch between theoretical advances and production realities and demand a principled, deployment-first redesign of learned optimizers. To bridge this gap, we present LOAM, a one-stop learned query optimization framework for MaxCompute. Its design principles and techniques generalize and are readily adaptable to similar systems. Architecturally, LOAM introduces a statistics-free plan encoding that leverages operator semantics and historical executions to infer details about data distributions and explicitly encodes the execution environments of training queries to learn their impacts on plan costs. For online queries with unknown environments at prediction time, LOAM provides a theoretical bound on the achievable performance and a practical strategy to smooth the environmental impacts on cost estimations. For system operating, LOAM integrates domain adaptation techniques into training to generalize effectively to online query plans without requiring conventional refinement. Additionally, LOAM includes a lightweight project selector to prioritize high-benefit deployment projects. LOAM has seen up to 30% CPU cost savings over MaxCompute's native query optimizer on production workloads, which could translate to substantial real-world resource savings.

翻译：现有的学习型查询优化器由于理想化的建模假设与设计选择，仍难以适配现代分布式、多租户数据仓库。本文以阿里巴巴MaxCompute为典型代表，揭示了任何可部署学习型查询优化器面临的四个根本性、系统无关的挑战：1）高度动态的执行环境导致计划成本存在巨大方差；2）成本估计所需输入统计信息可能缺失；3）传统模型精调方法难以实施；4）不同工作负载下的收益存在不确定性。这些挑战暴露出理论进展与生产现实之间的深刻脱节，亟需对学习型优化器进行以部署为先导的原则性重新设计。为弥合这一鸿沟，我们提出了LOAM——一个面向MaxCompute的一站式学习型查询优化框架。其设计原则与技术具有普适性，可轻松适配同类系统。在架构层面，LOAM提出了无需统计信息的计划编码方案，通过算子语义与历史执行记录推断数据分布细节，并显式编码训练查询的执行环境以学习其对计划成本的影响。针对预测时执行环境未知的在线查询，LOAM提供了可达性能的理论边界以及平滑环境对成本估计影响的实用策略。在系统运行层面，LOAM将领域自适应技术融入训练过程，无需传统精调即可有效泛化至在线查询计划。此外，LOAM还包含轻量级项目选择器，用于优先部署高收益项目。在生产工作负载上，LOAM相比MaxCompute原生查询优化器最高可节省30%的CPU成本，这在实际应用中可能转化为可观的资源节约。