蜂巢哈希表：一种面向GPU的线程束协同动态可扩展哈希表 (Hive Hash Table: A Warp-Cooperative, Dynamically Resizable Hash Table for GPUs)

Hash tables are essential building blocks in data-intensive applications, yet existing GPU implementations often struggle with concurrent updates, high load factors, and irregular memory access patterns. We present Hive hash table, a high-performance, warp-cooperative and dynamically resizable GPU hash table that adapts to varying workloads without global rehashing. Hive hash table makes three key contributions. First, a cache-aligned packed bucket layout stores key-value pairs as 64-bit words, enabling coalesced memory access and atomic updates via single-CAS operations. Second, warp-synchronous concurrency protocols - Warp-Aggregated-Bitmask-Claim (WABC) and Warp-Cooperative Match-and-Elect (WCME) - reduce contention to one atomic operation per warp while ensuring lock-free progress. Third, a load-factor-aware dynamic resizing strategy expands or contracts capacity in warp-parallel K-bucket batches using linear hashing, maintaining balanced occupancy. To handle insertions under heavy contention, Hive hash table employs a four-step strategy: replace, claim-and-commit, bounded cuckoo eviction, and overflow-stash fallback. This design provides lock-free fast paths and bounded recovery cost under contention determined by a fixed eviction depth, while eliminating ABA hazards during concurrent updates. Experimental evaluation on an NVIDIA RTX 4090 shows Hive hash table sustains load factors up to 95% while delivering 1.5-2x higher throughput than state-of-the-art GPU hash tables (Slab-Hash, DyCuckoo, WarpCore) under mixed insert-delete-lookup workloads. On balanced workload, Hive hash table reaches 3.5 billion updates/s and nearly 4 billion lookups/s, demonstrating scalability and efficiency for GPU-accelerated data processing.

翻译：哈希表是数据密集型应用中的核心构建模块，然而现有的GPU实现在并发更新、高负载因子及不规则内存访问模式方面常面临挑战。本文提出蜂巢哈希表，这是一种高性能、线程束协同且动态可扩展的GPU哈希表，能够适应动态工作负载而无需全局重哈希。蜂巢哈希表包含三项关键贡献。首先，采用缓存对齐的紧凑桶布局，将键值对存储为64位字，通过单CAS操作实现合并内存访问与原子更新。其次，线程束同步并发协议——线程束聚合位掩码声明与线程束协同匹配选举——将争用减少至每线程束一次原子操作，同时确保无锁进展。第三，基于负载因子的动态扩展策略，利用线性哈希以线程束并行的K桶批次扩展或收缩容量，维持均衡占用率。为处理高争用下的插入操作，蜂巢哈希表采用四步策略：替换、声明提交、有界布谷鸟驱逐及溢出暂存回退机制。该设计通过固定驱逐深度确定争用下的无锁快速路径与有界恢复成本，同时消除并发更新中的ABA风险。在NVIDIA RTX 4090上的实验评估表明，在混合插入-删除-查询工作负载下，蜂巢哈希表在保持高达95%负载因子的同时，吞吐量较先进GPU哈希表（Slab-Hash、DyCuckoo、WarpCore）提升1.5-2倍。在均衡工作负载下，蜂巢哈希表可实现35亿次更新/秒及近40亿次查询/秒，展现了GPU加速数据处理的可扩展性与高效性。