Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketches provide only coarse estimates under strict memory constraints. Although some learning-augmented methods have emerged recently, they typically rely on offline training with real frequencies or/and labels, which are often unavailable. Moreover, these methods suffer from slow update speeds, limiting their suitability for real-time processing despite offering only marginal accuracy improvements. To overcome these challenges, we propose UCL-sketch, a practical learning-based paradigm for per-key frequency estimation. Our design introduces two key innovations: (i) an online training mechanism based on equivalent learning that requires no ground truth (GT), and (ii) a highly scalable architecture leveraging logically structured estimation buckets to scale to real-world data stream. The UCL-sketch, which utilizes compressive sensing (CS), converges to an estimator that provably yields a error bound far lower than that of prior works, without sacrificing the speed of processing. Extensive experiments on both real-world and synthetic datasets demonstrate that our approach outperforms previously proposed approaches regarding per-key accuracy and distribution. Notably, under extremely tight memory budgets, its quality almost matches that of an (infeasible) omniscient oracle. Moreover, compared to the existing equation-based sketch, UCL-sketch achieves an average decoding speedup of nearly 500 times. To help further research and development, our code is publicly available at https://github.com/Y-debug-sys/UCL-sketch.

翻译：在大容量、高速数据流中估计条目频率的问题已在数据库和网络测量等多个领域得到广泛研究。传统草图在严格内存约束下仅能提供粗略估计。尽管近期出现了一些学习增强方法，但这些方法通常依赖包含真实频率和/或标签的离线训练数据，而此类数据往往难以获取。此外，现有方法更新速度较慢，虽能带来微弱的精度提升，却限制了其实时处理能力。为克服上述挑战，我们提出UCL-sketch——一种面向逐键频率估计的实用学习型范式。我们的设计包含两项关键创新：(i) 基于等效学习的在线训练机制，无需真实标签(GT)；(ii) 采用逻辑结构化估计桶的高可扩展架构，可适配真实数据流。基于压缩感知(CS)构建的UCL-sketch能够收敛到一种估计器，该估计器可证明其误差界远低于此前工作，且不会牺牲处理速度。在真实数据集与合成数据集上的大量实验表明，本方法在逐键精度与分布特性上均优于现有方案。值得注意的是，在极端严格的内存预算下，其质量几乎媲美（不可实现的）全知式估计器。此外，与现有基于方程式的草图相比，UCL-sketch平均解码速度提升近500倍。为促进后续研究与开发，我们的代码已开源至https://github.com/Y-debug-sys/UCL-sketch。