As server CPUs scale to dozens and now hundreds of cores per socket, parallel query engines must rethink how they redistribute data between threads. Partitioned operators such as hash joins and aggregations require frequent data redistribution across threads, yet existing intra-process shuffle designs fundamentally fail to scale with core count: batch partitioning avoids cross-thread synchronization in the hot path but materializes all intermediate data, introduces a global producer/consumer barrier, and requires a consumption approach with low cache locality, while channel-based streaming avoids materialization but incurs per-channel synchronization that scales poorly with core count. As core counts rise, these architectural tradeoffs increasingly prevent engines from fully utilizing modern hardware. We present a ring-buffer streaming shuffle design that addresses these shortcomings through lock-free atomic slot acquisition into fixed-size batch groups, achieving amortized O(1) synchronization cost per batch and O(M) memory independent of input size. Ring-buffer shuffle has been implemented in Redpanda's Oxla query engine for two years, where it currently powers production queries for Redpanda SQL users. We evaluate all three approaches on a 72-core NVIDIA GraceHopper, a 192-core dual-socket AWS Graviton4, and a 96-core (192-thread) AMD EPYC. On a 72-core single-socket system the ring buffer outperforms channel streaming by up to 44% and batch partitioning by up to 79%; at 192 cores the advantage over channel grows to over 100% and over 300% versus batch partitioning. Even so, on chiplet architectures with many partitioned L3 caches, the shared atomic counter becomes a cross-die bottleneck and channel-based streaming remains competitive. End-to-end Graviton4 evaluation on TPC-H (21 queries) and ClickBench (43 queries) shows the advantage is workload-shape-dependent.
翻译:随着服务器CPU每插槽核心数扩展到数十乃至数百个,并行查询引擎必须重新思考线程间数据重分布方式。哈希连接与聚合等分区操作算子需要频繁进行跨线程数据重分布,然而现有进程内shuffle设计从根本上无法随核心数扩展:批量分区方案在热路径中避免了跨线程同步,但需物化所有中间数据,引入全局生产者/消费者屏障,并采用缓存局部性较低的消费方式;而基于通道的流式方案虽无需物化,却存在随核心数扩展性欠佳的每通道同步开销。随着核心数增长,这些架构权衡日益阻碍引擎充分利用现代硬件。本文提出一种环形缓冲区流式shuffle设计,通过无锁原子槽位获取固定大小批次组,实现每批次摊销O(1)同步代价及与输入规模无关的O(M)内存开销。环形缓冲区shuffle已在Redpanda的Oxla查询引擎中部署两年,目前支撑着Redpanda SQL用户的生产查询。我们在72核NVIDIA GraceHopper、192核双插槽AWS Graviton4及96核(192线程)AMD EPYC上评估了三种方案。在72核单插槽系统中,环形缓冲区性能较通道流式提升最高44%,较批量分区提升最高79%;在192核时,较通道流式优势超过100%,较批量分区优势超过300%。然而在具有多分区L3缓存的chiplet架构上,共享原子计数器成为跨芯片瓶颈,通道流式仍具竞争力。基于TPC-H(21个查询)与ClickBench(43个查询)的全流程Graviton4评估表明,其优势取决于工作负载形态。