Cardinality estimation is a critical component and a longstanding challenge in modern data warehouses. ByteHouse, ByteDance's cloud-native engine for big data analysis in exabyte-scale environments, serves numerous internal decision-making business scenarios. With the increasing demand of ByteHouse, cardinality estimation becomes the bottleneck for efficiently processing queries. Specifically, the existing query optimizer of ByteHouse uses the traditional Selinger-like cardinality estimator, which can produce huge estimation errors, resulting in sub-optimal query plans. To improve cardinality estimation accuracy while maintaining a practical inference overhead, we develop ByteCard framework that enables efficient training/updating and integration of cardinality estimators. Furthermore, ByteCard adapts recent advances in cardinality estimation to build models that can balance accuracy and practicality (e.g., inference latency, model size, training/updating overhead). We observe significant query processing speed-up in ByteHouse after replacing the system's existing cardinality estimation with ByteCard's estimations for several optimization strategies. Evaluations on real-world datasets show the integration of ByteCard leads to an improvement of up to 30% in the 99th quantile of latency. At last, we share our valuable experience in engineering advanced cardinality estimators. We believe this experience can help other data warehouses integrate more accurate and sophisticated solutions on the critical path of query execution.
翻译:基数估计是现代数据仓库中的关键组件和长期挑战。字节跳动的云原生引擎ByteHouse用于支持EB级大数据分析,服务于众多内部决策支持业务场景。随着ByteHouse需求的增长,基数估计成为高效处理查询的瓶颈。具体而言,ByteHouse现有的查询优化器采用传统类Selinger基数估计器,可能产生巨大估计误差,导致次优查询计划。为在保持实用推理开销的同时提升基数估计精度,我们开发了ByteCard框架,支持基数估计器的高效训练/更新与集成。此外,ByteCard借鉴基数估计领域最新进展,构建能够平衡精度与实用性(如推理延迟、模型规模、训练/更新开销)的模型。我们观察到,在ByteHouse系统中用ByteCard的估计替换现有基数估计后,多项优化策略的查询处理速度显著提升。实际数据集上的评估表明,集成ByteCard使99分位延迟最多降低30%。最后,我们分享了工程化先进基数估计器的宝贵经验。我们相信这些经验能帮助其他数据仓库在查询执行关键路径上集成更精确、更复杂的解决方案。