Approximate Membership Query (AMQ) structures are essential for high-throughput systems in databases, networking, and bioinformatics. While Bloom filters offer speed, they lack support for deletions. Existing GPU-based dynamic alternatives, such as the Two-Choice Filter (TCF) and GPU Quotient Filter (GQF), enable deletions but incur severe performance penalties. We present Cuckoo-GPU, an open-source, high-performance Cuckoo filter library for GPUs. Instead of prioritizing cache locality, Cuckoo-GPU embraces the inherently random access pattern of Cuckoo hashing to fully saturate global memory bandwidth. Our design features a lock-free architecture built on atomic compare-and-swap operations, paired with a novel breadth-first search-based eviction heuristic that minimizes thread divergence and bounds sequential memory accesses during high-load insertions. Evaluated on NVIDIA GH200 (HBM3) and RTX PRO 6000 Blackwell (GDDR7) systems, Cuckoo-GPU closes the performance gap between append-only and dynamic AMQ structures. It achieves insertion, query, and deletion throughputs up to 378x (4.1x), 6x (34.7x), and 258x (107x) higher than GQF (TCF) on the same hardware, respectively, and delivers up to a 350x speedup over the fastest available multi-threaded CPU-based Cuckoo filter implementation. Moreover, its query throughput rivals that of the append-only GPU-based Blocked Bloom filter - demonstrating that dynamic AMQ structures can be deployed on modern accelerators without sacrificing performance.
翻译:近似成员查询(AMQ)结构对于数据库、网络和生物信息学中的高吞吐量系统至关重要。虽然布隆过滤器速度快,但不支持删除操作。现有的基于GPU的动态替代方案,如双选择过滤器(TCF)和GPU商过滤器(GQF),虽然支持删除,但会导致严重的性能损失。我们提出了Cuckoo-GPU,一个开源的、高性能的GPU布谷鸟过滤器库。Cuckoo-GPU并未优先考虑缓存局部性,而是充分利用布谷鸟哈希固有的随机访问模式,以完全饱和全局内存带宽。我们的设计采用基于原子比较并交换操作的无锁架构,并搭配一种新颖的基于广度优先搜索的驱逐启发式方法,该方法在高负载插入期间最小化线程发散并限制顺序内存访问。在NVIDIA GH200(HBM3)和RTX PRO 6000 Blackwell(GDDR7)系统上的评估表明,Cuckoo-GPU弥合了仅追加型与动态AMQ结构之间的性能差距。在相同硬件上,其插入、查询和删除吞吐量分别比GQF(TCF)高出最高378倍(4.1倍)、6倍(34.7倍)和258倍(107倍),并且比最快的可用多线程基于CPU的布谷鸟过滤器实现提速高达350倍。此外,其查询吞吐量与仅追加型的基于GPU的分块布隆过滤器相当——这表明动态AMQ结构可以在不牺牲性能的情况下部署在现代加速器上。