布谷鸟重度保持器与流处理中维护重度命中项的平衡艺术 (Cuckoo Heavy Keeper and the balancing act of maintaining heavy hitters in stream processing)

Finding heavy hitters in databases and data streams is a fundamental problem with applications ranging from network monitoring to database query optimization, machine learning, and more. Approximation algorithms offer practical solutions, but they present trade-offs involving throughput, memory usage, and accuracy. Moreover, modern applications further complicate these trade-offs by demanding capabilities beyond sequential processing that require both parallel scaling and support for concurrent queries and updates. Analysis of these trade-offs led us to the key idea behind our proposed streaming algorithm, Cuckoo Heavy Keeper (CHK). The approach introduces an inverted process for distinguishing frequent from infrequent items, which unlocks new algorithmic synergies that were previously inaccessible with conventional approaches. By further analyzing the competing metrics with a focus on parallelism, we propose an algorithmic framework that balances scalability aspects and provides options to optimize query and insertion efficiency based on their relative frequencies. The framework is capable of parallelizing any heavy-hitter detection algorithm. Besides the algorithms' analysis, we present an extensive evaluation on both real-world and synthetic data across diverse distributions and query selectivity, representing the broad spectrum of application needs. Compared to state-of-the-art methods, CHK improves throughput by 1.7-5.7$\times$ and accuracy by up to four orders of magnitude even under low-skew data and tight memory constraints. These properties allow its parallel instances to achieve near-linear scale-up and low latency for heavy-hitter queries, even under a high query rate. We expect the versatility of CHK and its parallel instances to impact a broad spectrum of tools and applications in large-scale data analytics and stream processing systems

翻译：在数据库与数据流中识别重度命中项是一个基础性问题，其应用范围涵盖网络监控、数据库查询优化、机器学习等诸多领域。近似算法提供了实用的解决方案，但需要在吞吐量、内存使用与准确性之间进行权衡。此外，现代应用场景进一步加剧了这种权衡的复杂性，因为它们不仅要求顺序处理能力，还需要并行扩展性以及对并发查询与更新的支持。通过对这些权衡关系的深入分析，我们提出了流处理算法布谷鸟重度保持器（Cuckoo Heavy Keeper, CHK）的核心设计思想。该方法引入了一种逆向区分高频项与低频项的机制，从而解锁了传统方法无法实现的新型算法协同效应。通过聚焦并行性进一步分析这些相互制约的指标，我们提出了一个能平衡可扩展性维度的算法框架，该框架可根据查询与插入操作的相对频率提供优化其效率的配置选项。该框架具备对任意重度命中项检测算法进行并行化的能力。除算法分析外，我们在涵盖多种分布类型与查询选择度的真实数据集与合成数据集上进行了全面评估，以反映广泛的应用需求。与现有先进方法相比，即使在低偏斜数据与严格内存限制下，CHK仍能实现1.7-5.7倍的吞吐量提升，并将准确度提高多达四个数量级。这些特性使其并行实例即使在高查询负载下，也能为重度命中项查询实现接近线性的扩展效率与低延迟。我们预期CHK及其并行实例的多功能性将对大规模数据分析与流处理系统中的广泛工具及应用产生重要影响。