Learned cardinality estimation methods have achieved high precision compared to traditional methods. Among learned methods, query-driven approaches face the data and workload drift problem for a long time. Although both query-driven and hybrid methods are proposed to avoid this problem, even the state-of-art of them suffer from high training and estimation costs, limited scalability, instability, and long-tailed distribution problem on high cardinality and high dimensional tables, which seriously affects the practical application of learned cardinality estimators. In this paper, we prove that most of these problems are directly caused by the widely used progressive sampling. We solve this problem by introducing predicates into the autoregressive model and propose Duet, a stable, efficient, and scalable hybrid method to estimate cardinality directly without sampling or any non-differentiable process, which can not only reduces the inference complexity from $O(n)$ to $O(1)$ compared to Naru and UAE but also achieve higher accuracy on high cardinality and high dimensional tables. Experimental results show that Duet can achieve all the design goals above and be much more practical and even has a lower inference cost on CPU than that of most learned methods on GPU.
翻译:学习型基数估计方法相较于传统方法已实现了高精度。在学习方法中,查询驱动方法长期面临数据和工作负载漂移问题。尽管已提出查询驱动和混合方法以避免该问题,但即便是最先进的方法仍存在训练与估计成本高、可扩展性有限、稳定性差,以及高基数高维表上的长尾分布问题,这严重影响了学习型基数估计器的实际应用。本文证明,这些问题大多直接源于广泛使用的渐进采样。我们通过将谓词引入自回归模型解决该问题,并提出Duet——一种稳定、高效且可扩展的混合方法,无需采样或任何非可微过程即可直接估计基数,与Naru和UAE相比,推理复杂度从$O(n)$降至$O(1)$,并在高基数高维表上实现了更高精度。实验结果表明,Duet能够达成上述所有设计目标,且具有更强的实用性,甚至其CPU推理成本低于多数学习型方法在GPU上的成本。