BlobShuffle: Cost-Effective Repartitioning in Stream Processing Systems via Object Storage Exemplified with Kafka Streams

Shuffling or repartitioning data streams is an essential operation of state-of-the-art stream processing frameworks to support stateful workloads in a large-scale, distributed setting. In today's cloud deployments, however, shuffling can become a major cost driver due to substantial network traffic across multiple availability zones (AZs) as well as an operational burden when operating a high-throughput, strongly consistent messaging backbone at scale. We present BlobShuffle, a novel approach to cost-effective shuffling for stream processing systems that leverages cloud object storage as an intermediate exchange layer. Instead of sending all shuffled records directly, BlobShuffle groups records into batches, stores these batches in cloud object storage, and forwards only compact notifications. Downstream operators use these notifications to retrieve the relevant batches and extract the corresponding records. BlobShuffle balances cost efficiency and latency through configurable batching and a distributed caching mechanism. BlobShuffle is implemented as an add-on for Kafka Streams that requires only minimal code changes to existing applications, leaves Kafka and the underlying infrastructure unmodified, and preserves Kafka Streams' consistency and correctness guarantees. In a large-scale experimental evaluation on a Kubernetes-based AWS deployment, we show that BlobShuffle can reduce shuffling costs by more than 40x compared to native Kafka Streams shuffling while keeping the 95th percentile shuffle latency below 2 seconds. Moreover, it scales to processing more than 2 GiB/s without encountering a scalability limit in our experiments, indicating that BlobShuffle can economically support shuffle-intensive workloads at large scale.

翻译：数据流的分区重排或重分区是现代流处理框架在分布式大规模环境下支持有状态工作负载的核心操作。然而，在当前云部署场景中，重分区操作因跨多个可用区的网络流量而产生显著成本，同时运维高吞吐、强一致性的消息中间件也会带来沉重的运维负担。本文提出BlobShuffle，一种利用云对象存储作为中间交换层的流处理系统经济高效重分区方法。BlobShuffle并非直接发送所有重分区记录，而是将记录分批存储至云对象存储，仅转发轻量级通知。下游算子通过通知检索对应批次并提取所需记录。通过可配置的批处理机制与分布式缓存策略，BlobShuffle实现了成本效率与延迟之间的平衡。BlobShuffle作为Kafka Streams的扩展组件实现，对现有应用仅需极少的代码修改，不改变Kafka及其底层基础设施，同时完整保留了Kafka Streams的强一致性与正确性保证。在基于Kubernetes的AWS部署环境下进行的大规模实验评估表明，与原生Kafka Streams重分区相比，BlobShuffle可将重分区成本降低40倍以上，同时保持95%分位重分区延迟低于2秒。此外，该方案在实验中可处理超过2 GiB/s的数据流而未遇到可扩展性瓶颈，证明BlobShuffle能够以经济高效的方式支持大规模重分区密集型工作负载。