We study the problem of cardinality estimation for LIKE queries on string data, focusing on the most common patterns in real workloads: prefix, suffix, and substring queries. We propose LEARNT, a LIKE query Estimator with Accuracy, Robustness, Negligible overhead, Tunability, and Theoretical guarantees. LEARNT formulates estimation as a bucket-classification problem, and upon correct classification, it yields formal bounds on Q-error for the queries with non-empty answer. It employs a memory-efficient bucketed layered-filter architecture with Bloom filters and compact auxiliary tables, together with optimizations that exploit query skew to reduce storage. For the queries that have empty answer, LEARNT incorporates dedicated filter-based and prefix-walk strategies, providing probabilistic guarantees on correct identification. Furthermore, to support arbitrarily long query strings, we extend LEARNT with Markov modeling scheme that composes short-query statistics into estimates for longer queries. A theoretical framework guides parameter selection to minimize storage under accuracy and robustness constraints. Extensive experiments on four real-world datasets show that LEARNT consistently outperforms state-of-the-art methods such as CLIQUE and LPLM, achieving 1.3-1.7x lower mean Q-error, significantly lower tail errors, and up to 70x faster construction with comparable memory usage.
翻译:本文研究字符串数据上LIKE查询的基数估计问题,重点针对实际负载中最常见的模式:前缀、后缀和子串查询。我们提出LEARNT——一种具备准确性、鲁棒性、低开销、可调性和理论保证的LIKE查询估计器。LEARNT将估计问题建模为桶分类问题,在正确分类的情况下,可为非空答案查询提供关于Q-error的正式界。该估计器采用基于布隆过滤器和紧凑辅助表的内存高效桶式分层过滤器架构,并结合利用查询倾斜性以减少存储的优化技术。对于空答案查询,LEARNT集成了专用的基于过滤器与基于前缀遍历的策略,提供正确识别的概率保证。此外,为支持任意长度查询字符串,我们通过马尔可夫建模范式扩展LEARNT,将短查询统计量组合为长查询的估计值。理论框架指导参数选择,以在精度与鲁棒性约束下最小化存储开销。在四个真实数据集上的大量实验表明,LEARNT在平均Q-error上比CLIQUE和LPLM等最先进方法降低1.3-1.7倍,尾部误差显著更低,构建速度提升最高达70倍,且内存使用相当。