Pre-Execution Query Slot-Time Prediction in Cloud Data Warehouses: A Feature-Scoped Machine Learning Approach

Cloud data warehouses bill compute based on slot-time consumed. In shared multi-tenant environments, query cost is highly variable and hard to estimate before execution, causing budget overruns and degraded scheduling. Static query-planner heuristics fail to capture complex SQL structure, data skew, and workload contention. We present a feature-scoped machine learning approach that predicts BigQuery slot-time before execution using only pre-execution observable signals: a structured query complexity score derived from SQL operator costs, data volume features from planner estimates and workload metadata, and textual features from query text. We deliberately exclude runtime factors (slot-pool utilization, cache state, realized skew) unknowable at submission. The model uses a HistGradientBoostingRegressor trained on log-transformed slot-time, with a TF-IDF + TruncatedSVD-512 text pipeline fused with numeric and categorical features. Trained on 749 queries across seven deployment environments and evaluated out-of-distribution on 746 queries from two held-out environments, the model achieves MAE 1.17 slot-minutes, RMSE 4.71, and 74% explained variance on the full workload. On cost-significant queries (slot-time >= 0.01 min, N=282) the model achieves MAE 3.10 versus 4.95 for a predict-mean baseline and 4.54 for predict-median, a 30-37% reduction. On long-tail queries (>= 20 min, N=22) the model does not outperform trivial baselines, consistent with the hypothesis that long-tail queries are dominated by unobserved runtime factors outside the current feature scope. A complexity-routed dual-model architecture is described as a practical refinement, and directions for closing the long-tail gap are identified as future work.

翻译：云端数据仓库按消耗的时隙时间计费。在共享多租户环境中，查询成本高度可变且难以在执行前准确估算，导致预算超支和调度性能下降。静态查询规划器启发式方法无法捕捉复杂的SQL结构、数据倾斜和工作负载竞争。我们提出一种特征限定型机器学习方法，仅利用执行前可观测信号预测BigQuery时隙时间：基于SQL算子成本的结构化查询复杂度评分、来自规划器估计和工作负载元数据的数据量特征，以及查询文本中的文本特征。我们刻意排除了提交时无法获知的运行时因素（时隙池利用率、缓存状态、实际倾斜）。模型采用对时隙时间取对数变换后训练的HistGradientBoostingRegressor，通过TF-IDF+TruncatedSVD-512文本流水线与数值和类别特征融合。该模型在七个部署环境中的749条查询上训练，并在两个保留环境的746条查询上进行分布外评估，在全量工作负载上实现了1.17时隙分钟的MAE、4.71的RMSE以及74%的解释方差。对于成本显著的查询（时隙时间>=0.01分钟，N=282），模型MAE为3.10，而预测均值基线和预测中位数基线分别为4.95和4.54，降低了30%-37%。对于长尾查询（>=20分钟，N=22），模型未超越平凡基线，这与长尾查询主要由当前特征范围外不可观测的运行时因素主导的假设一致。我们描述了复杂度路由双模型架构作为实用改进，并指出了弥合长尾差距的未来工作方向。