Modern stream processing systems often need to track the frequency of distinct keys in a data stream in real-time. Since maintaining exact counts can require a prohibitive amount of memory, many applications rely on compact, probabilistic data structures known as frequency estimation sketches to approximate them. However, mainstream frequency estimation sketches fall short in two critical aspects. First, they are memory-inefficient under skewed workloads because they use uniformly-sized counters to count the keys, thus wasting memory on storing the leading zeros of many small counts. Second, their estimation error deteriorates at least linearly with the length of the stream--which may grow indefinitely--because they rely on a fixed number of counters. We present Sublime, a framework that generalizes frequency estimation sketches to address these challenges. To reduce memory footprint under skew, Sublime begins with short counters and dynamically elongates them as they overflow, storing their extensions within the same cache line. It employs efficient bit manipulation routines to quickly locate and access a counter's extensions. To maintain accuracy as the stream grows, Sublime also expands its number of counters at a configurable rate, exposing a new spectrum of accuracy-memory tradeoffs that applications can tune to their needs. We apply Sublime to both Count-Min Sketch and Count Sketch. Through theoretical analysis and empirical evaluation, we show that Sublime significantly improves accuracy and memory over the state of the art while maintaining competitive or superior performance.
翻译:摘要:现代流处理系统常需实时追踪数据流中不同键的频率。由于精确计数可能需消耗海量内存,许多应用采用紧凑的概率数据结构(即频率估计草图)进行近似。然而,主流频率估计草图存在两个关键缺陷。首先,在偏斜工作负载下内存效率低下,因其使用统一大小的计数器对键计数,导致大量内存浪费于存储较小计数值的前导零。其次,其估计误差至少随流长度线性增长(流可能无限延长),因为其依赖固定数量的计数器。我们提出Sublime框架,该框架泛化频率估计草图以应对上述挑战。为降低偏斜场景下的内存占用,Sublime初始使用短计数器,并在溢出时动态延长,将扩展部分存储于同一缓存行内,借助高效位操作例程快速定位与访问计数器扩展。为维持流增长时的精度,Sublime还以可配置速率扩展计数器数量,展现出应用程序可根据需求调节的精度-内存权衡新频谱。我们将Sublime应用于Count-Min Sketch与Count Sketch。通过理论分析与实验评估表明,Sublime在保持竞争性或更优性能的同时,显著提升了现有技术的精度与内存效率。