Cardinality estimation methods based on probability distribution estimation have achieved high-precision estimation results compared to traditional methods. However, the most advanced methods suffer from high estimation costs due to the sampling method they use when dealing with range queries. Also, such a sampling method makes them difficult to differentiate, so the supervision signal from the query workload is difficult to train the model to improve the accuracy of cardinality estimation. In this paper, we propose a new hybrid and deterministic modeling approach (Duet) for the cardinality estimation problem which has better efficiency and scalability compared to previous approaches. Duet allows for direct cardinality estimation of range queries with significantly lower time and memory costs, as well as in a differentiable form. As the prediction process of this approach is differentiable, we can incorporate queries with larger model estimation errors into the training process to address the long-tail distribution problem of model estimation errors on high dimensional tables. We evaluate Duet on classical datasets and benchmarks, and the results prove the effectiveness of Duet.
翻译:基于概率分布估计的基数估计方法相比传统方法实现了高精度的估计结果。然而,最先进的方法在处理范围查询时,由于其采用的采样方法导致估计成本过高。此外,这种采样方法使得模型难以区分,因此来自查询工作负载的监督信号难以训练模型以提高基数估计的准确性。本文针对基数估计问题提出了一种新的混合确定性建模方法(Duet),相比现有方法具有更高的效率和可扩展性。Duet可直接对范围查询进行基数估计,显著降低了时间和内存成本,并且以可微分形式实现。由于该方法的预测过程是可微的,我们可以将模型估计误差较大的查询纳入训练过程,以解决高维表上模型估计误差的长尾分布问题。我们在经典数据集和基准测试上评估了Duet,结果证明了Duet的有效性。