Finding heavy hitters in databases and data streams is a fundamental problem with applications ranging from network monitoring to database query optimization, machine learning, and more. Approximation algorithms offer practical solutions, but they present trade-offs involving throughput, memory usage, and accuracy. Moreover, modern applications further complicate these trade-offs by demanding capabilities beyond sequential processing - requiring both parallel performance scaling and support for concurrent queries/updates. Analysis of these trade-offs led us to the key idea behind our proposed streaming algorithm, Cuckoo Heavy Keeper (CHK). The approach introduces an inverted process for distinguishing frequent from infrequent items, which unlocks new algorithmic synergies that were previously inaccessible with conventional approaches. By further analyzing the competing metrics with a focus on parallelism, we propose an algorithmic framework that balances scalability aspects and provides options to optimize query and insertion efficiency based on their relative frequencies. The framework is suitable for parallelizing any sequential heavy-hitter algorithm. Besides the algorithms' analysis, we present an extensive evaluation on both real-world and synthetic data across diverse distributions and query selectivity, representing the broad spectrum of application needs. Compared to state-of-the-art methods, CHK improves throughput by 1.7-5.7X and accuracy by up to four orders of magnitude even under low-skew data and tight memory constraints. These properties allow its parallel instances to achieve near-linear scale-up and low latency for heavy-hitter queries, even under a high query rate. We expect the versatility of CHK and its parallel instances to impact a broad spectrum of tools and applications in large-scale data analytics and stream processing systems.
翻译:在数据库与数据流中识别重型元素是一个基础性问题,其应用范围涵盖网络监控、数据库查询优化、机器学习等诸多领域。近似算法提供了实用的解决方案,但需要在吞吐量、内存使用和准确性之间进行权衡。此外,现代应用场景因其超越顺序处理的需求——既要求并行性能扩展,又需支持并发查询/更新——使得这些权衡关系进一步复杂化。通过对这些权衡关系的分析,我们提出了流处理算法Cuckoo Heavy Keeper(CHK)的核心设计思路。该方法引入了一种逆向区分高频与低频元素的机制,从而解锁了传统方法无法实现的新算法协同效应。通过进一步以并行性为重点分析这些竞争性指标,我们提出了一个算法框架,该框架平衡了可扩展性维度,并提供了根据查询与插入操作相对频率优化其效率的选项。该框架适用于并行化任何顺序型重型元素识别算法。除算法分析外,我们在真实数据集与合成数据上进行了广泛评估,涵盖多种数据分布与查询选择性,代表了广泛的应用需求谱系。与现有先进方法相比,即使在低偏斜数据和严格内存限制下,CHK仍能实现1.7-5.7倍的吞吐量提升,并将精度提高多达四个数量级。这些特性使其并行实例即使在高查询负载下,也能为重型元素查询实现近线性的扩展和低延迟。我们预期CHK及其并行实例的多功能性将对大规模数据分析和流处理系统中的广泛工具与应用产生重要影响。