Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketches provide only coarse estimates under strict memory constraints. Although some learning-augmented methods have emerged recently, they typically rely on offline training with real frequencies or/and labels, which are often unavailable. Moreover, these methods suffer from slow update speeds, limiting their suitability for real-time processing despite offering only marginal accuracy improvements. To overcome these challenges, we propose UCL-sketch, a practical learning-based paradigm for per-key frequency estimation. Our design introduces two key innovations: (i) an online training mechanism based on equivalent learning that requires no ground truth (GT), and (ii) a highly scalable architecture leveraging logically structured estimation buckets to scale to real-world data stream. The UCL-sketch, which utilizes compressive sensing (CS), converges to an estimator that provably yields a error bound far lower than that of prior works, without sacrificing the speed of processing. Extensive experiments on both real-world and synthetic datasets demonstrate that our approach outperforms previously proposed approaches regarding per-key accuracy and distribution. Notably, under extremely tight memory budgets, its quality almost matches that of an (infeasible) omniscient oracle. Moreover, compared to the existing equation-based sketch, UCL-sketch achieves an average decoding speedup of nearly 500 times. To help further research and development, our code is publicly available at https://github.com/Y-debug-sys/UCL-sketch.

翻译：在数据库与网络测量等诸多领域，对高吞吐、快速数据流中项目频率的估计已得到广泛研究。传统草图方法在严格的内存约束下仅能提供粗略估计。尽管近期出现了一些学习增强型方法，但它们通常依赖于使用真实频率或/和标签进行离线训练，而这些信息往往难以获取。此外，这些方法存在更新速度慢的问题，尽管仅带来有限的精度提升，却限制了其在实时处理场景中的适用性。为克服这些挑战，我们提出UCL-sketch——一种实用的基于学习的单键频率估计范式。我们的设计引入两项关键创新：(i) 基于等效学习的在线训练机制，无需真实值(GT)；(ii) 利用逻辑结构化估计桶的高度可扩展架构，可适应实际数据流规模。UCL-sketch采用压缩感知(CS)技术，可收敛到一个理论上具有远低于现有方法的误差界的估计器，且不牺牲处理速度。在真实数据集与合成数据集上的大量实验表明，本方法在单键估计精度与分布特性方面均优于现有方案。值得注意的是，在极端紧缩的内存预算下，其估计质量几乎接近（实际不可实现的）全知预言机的水平。此外，与现有基于方程的草图方法相比，UCL-sketch平均解码速度提升近500倍。为促进后续研究与开发，我们的代码已在https://github.com/Y-debug-sys/UCL-sketch公开。