Elastic Sketch under Random Stationary Streams: Limiting Behavior and Near-Optimal Configuration

\texttt{Elastic-Sketch} is a hash-based data structure for counting item's appearances in a data stream, and it has been empirically shown to achieve a better memory-accuracy trade-off compared to classical methods. This algorithm combines a \textit{heavy block}, which aims to maintain exact counts for a small set of dynamically \textit{elected} items, with a light block that implements \texttt{Count-Min} \texttt{Sketch} (\texttt{CM}) for summarizing the remaining traffic. The heavy block dynamics are governed by a hash function~$β$ that hashes items into~$m_1$ buckets, and an \textit{eviction threshold}~$λ$, which controls how easily an elected item can be replaced. We show that the performance of \texttt{Elastic-Sketch} strongly depends on the stream characteristics and the choice of~$λ$. Since optimal parameter choices depend on unknown stream properties, we analyze \texttt{Elastic-Sketch} under a \textit{stationary random stream} model -- a common assumption that captures the statistical regularities observed in real workloads. Formally, as the stream length goes to infinity, we derive closed-form expressions for the limiting distribution of the counters and the resulting expected counting error. These expressions are efficiently computable, enabling practical grid-based tuning of the heavy and \texttt{CM} blocks memory split (via $m_1$) and the eviction threshold~$λ$. We further characterize the structure of the optimal eviction threshold, substantially reducing the search space and showing how this threshold depends on the arrival distribution. Extensive numerical simulations validate our asymptotic results on finite streams from the Zipf distribution.

翻译：弹性草图是一种基于哈希的数据结构，用于统计数据流中项目的出现次数，经验表明其相比经典方法能实现更好的内存-精度权衡。该算法将旨在为动态选出的小规模项目集合维护精确计数的重块，与为剩余流量实施Count-Min Sketch（CM）的轻块相结合。重块的动态行为由将项目哈希到$m_1$个桶的哈希函数$β$，以及控制当选项目被替换难易程度的驱逐阈值$λ$共同决定。我们证明弹性草图的性能强烈依赖于流特征与$λ$的选择。由于最优参数选择取决于未知的流特性，我们在平稳随机流模型下分析弹性草图——该模型是捕捉实际工作负载中统计规律性的常见假设。形式化地，当流长度趋于无穷时，我们推导出计数器极限分布及相应期望计数误差的闭式表达式。这些表达式可高效计算，从而支持通过网格搜索对重块与CM块的内存分配（通过$m_1$）及驱逐阈值$λ$进行实际调优。我们进一步刻画了最优驱逐阈值的结构，显著缩减了搜索空间，并揭示了该阈值如何依赖于到达分布。基于Zipf分布的有限流大量数值模拟验证了我们的渐近结论。