Modern stream processing systems must often track the frequency of distinct keys in a data stream in real-time. Since monitoring the exact counts often entails a prohibitive memory footprint, many applications rely on compact, probabilistic data structures called frequency estimation sketches to approximate them. However, mainstream frequency estimation sketches fall short in two critical aspects: (1) They are memory-inefficient under data skew. This is because they use uniformly-sized counters to track the key counts and thus waste memory on storing the leading zeros of many small counter values. (2) Their estimation error deteriorates at least linearly with the stream's length, which may grow indefinitely over time. This is because they count the keys using a fixed number~of~counters. We present Sublime, a framework that generalizes frequency estimation sketches to address these problems by dynamically adapting to the stream's skew and length. To save memory under skew, Sublime uses short counters upfront and elongates them with extensions stored within the same cache line as they overflow. It leverages novel bit manipulation routines to quickly access a counter's extension. It also controls the scaling of its error rate by expanding its number of approximate counters as the stream grows. We apply Sublime to Count-Min Sketch and Count Sketch. We show, theoretically and empirically, that Sublime significantly improves accuracy and memory over the state of the art while maintaining competitive or superior performance.
翻译:现代流处理系统通常需要实时追踪数据流中不同键的频率。由于精确计数监控往往需要过高的内存开销,许多应用依赖于称为频率估计草图(frequency estimation sketches)的紧凑概率数据结构进行近似估计。然而,主流频率估计草图在以下两个关键方面存在不足:(1)在数据偏斜场景下内存效率低下。这是因为它们使用统一大小的计数器追踪键频次,从而在存储大量小计数值的前导零时浪费内存。(2)其估计误差随数据流长度至少线性恶化,而流长度可能随时间无限增长。这是因为它们使用固定数量的计数器进行键计数。本文提出Sublime框架,该框架通过动态适应数据流的偏斜程度与长度,对频率估计草图进行泛化以解决上述问题。为在偏斜场景下节省内存,Sublime首先使用短计数器,当计数器溢出时通过存储在同一缓存行中的扩展位进行动态延长。该框架利用新颖的位操作例程实现计数器扩展位的快速访问。同时,通过随数据流增长扩展近似计数器数量,实现对误差率缩放的有效控制。我们将Sublime应用于Count-Min Sketch与Count Sketch,通过理论与实验证明:在保持竞争力或更优性能的同时,Sublime在准确性与内存效率方面较现有技术实现显著提升。