Shuffle is one of the most expensive communication primitives in distributed data processing and is difficult to scale. Prior work addresses the scalability challenges of shuffle by building monolithic shuffle systems. These systems are costly to develop, and they are tightly integrated with batch processing frameworks that offer only high-level APIs such as SQL. New applications, such as ML training, require more flexibility and finer-grained interoperability with shuffle. They are often unable to leverage existing shuffle optimizations. We propose an extensible shuffle architecture. We present Exoshuffle, a library for distributed shuffle that offers competitive performance and scalability as well as greater flexibility than monolithic shuffle systems. We design an architecture that decouples the shuffle control plane from the data plane without sacrificing performance. We build Exoshuffle on Ray, a distributed futures system for data and ML applications, and demonstrate that we can: (1) rewrite previous shuffle optimizations as application-level libraries with an order of magnitude less code, (2) achieve shuffle performance and scalability competitive with monolithic shuffle systems, and break the CloudSort record as the world's most cost-efficient sorting system, and (3) enable new applications such as ML training to easily leverage scalable shuffle.
翻译:Shuffle是分布式数据处理中最昂贵的通信原语之一,且难以扩展。现有工作通过构建单体式Shuffle系统应对可扩展性挑战,此类系统开发成本高昂,且与仅提供SQL等高层API的批处理框架紧密耦合。机器学习训练等新型应用需要更高的灵活性及与Shuffle的细粒度互操作性,往往无法利用现有Shuffle优化技术。我们提出一种可扩展的Shuffle架构:Exoshuffle——一种分布式Shuffle库,在提供与单体式Shuffle系统相当的性能与可扩展性的同时,具备更强的灵活性。我们设计的架构在不牺牲性能的前提下解耦了Shuffle控制平面与数据平面。基于面向数据与机器学习应用的分布式未来系统Ray构建Exoshuffle,我们证明能够:(1)以数量级更少的代码将现有Shuffle优化重构为应用级库;(2)达到与单体式Shuffle系统相匹敌的Shuffle性能与可扩展性,并打破CloudSort记录成为全球最具成本效益的排序系统;(3)使机器学习训练等新型应用轻松利用可扩展的Shuffle。