Modern stream processing systems often need to track the frequency of distinct keys in a data stream in real-time. Since maintaining exact counts can require a prohibitive amount of memory, many applications rely on compact, probabilistic data structures known as frequency estimation sketches to approximate them. However, mainstream frequency estimation sketches fall short in two critical aspects. First, they are memory-inefficient under skewed workloads because they use uniformly-sized counters to count the keys, thus wasting memory on storing the leading zeros of many small counts. Second, their estimation error deteriorates at least linearly with the length of the stream--which may grow indefinitely--because they rely on a fixed number of counters. We present Sublime, a framework that generalizes frequency estimation sketches to address these challenges. To reduce memory footprint under skew, Sublime begins with short counters and dynamically elongates them as they overflow, storing their extensions within the same cache line. It employs efficient bit manipulation routines to quickly locate and access a counter's extensions. To maintain accuracy as the stream grows, Sublime also expands its number of counters at a configurable rate, exposing a new spectrum of accuracy-memory tradeoffs that applications can tune to their needs. We apply Sublime to both Count-Min Sketch and Count Sketch. Through theoretical analysis and empirical evaluation, we show that Sublime significantly improves accuracy and memory over the state of the art while maintaining competitive or superior performance.
翻译:摘要:现代流处理系统常需实时追踪数据流中不同键的频率。由于维护精确计数可能消耗大量内存,许多应用依赖称为频率估计草图(frequency estimation sketches)的紧凑型概率数据结构进行近似估计。然而,主流频率估计草图存在两个关键缺陷:其一,在偏斜工作负载下内存效率低下——因其采用统一尺寸计数器统计键频率,导致大量小计数值的前导零被浪费存储;其二,由于依赖固定数量计数器,估计误差至少随流长度线性增长(而流可能无限延伸)。我们提出Sublime框架,通过泛化频率估计草图来解决上述挑战。为降低偏斜场景下的内存占用,Sublime采用短计数器启动,并在计数器溢出时动态扩展其长度,将扩展部分存储于同一缓存行内。通过高效位操作例程快速定位并访问计数器的扩展部分。为维持随流增长时的精确性,Sublime以可配置速率扩展计数器数量,开辟了应用可根据需求调节的新型精度-内存权衡谱。我们将Sublime应用于Count-Min Sketch与Count Sketch。理论分析与实验评估表明,Sublime在保持竞争力或更优性能的同时,显著提升现有最佳方法的精度与内存效率。