Shuffle exchanges intermediate results between upstream and downstream operators in distributed data processing and is usually the bottleneck due to factors such as small random I/Os and network contention. Several systems have been designed to improve shuffle efficiency, but from our experiences of running ultra-large clusters at Alibaba Cloud MaxCompute platform, we observe that they can not adapt to highly dynamic job characteristics and cluster resource conditions, and their fault tolerance mechanisms are passive and inefficient when failures are inevitable. To tackle their limitations, we design and implement FuxiShuffle as a general data shuffle service for the ultra-large production environment of MaxCompute, featuring good adaptability and efficient failure resilience. Specifically, to achieve good adaptability, FuxiShuffle dynamically selects the shuffle mode based on runtime information, conducts progress-aware scheduling for the downstream workers, and automatically determines the most suitable backup strategy for each shuffle data chunk. To make failure resilience efficient, FuxiShuffle actively ensures data availability with multi-replica failover, prevents memory overflow with careful memory management, and employs an incremental recovery mechanism that does not lose computation progress. Our experiments show that, compared to baseline systems, FuxiShuffle significantly reduces not only end-to-end job completion time but also aggregate resource consumption. Micro experiments suggest that our designs are effective in improving adaptability and failure resilience.
翻译:在分布式数据处理中,Shuffle负责在上游与下游算子间交换中间结果,通常因小规模随机I/O及网络争用等因素成为系统瓶颈。已有若干系统旨在提升Shuffle效率,但根据我们在阿里云MaxCompute平台超大规模集群的运维经验,发现这些系统难以适应高度动态的作业特性与集群资源状况,且其容错机制在故障不可避免时呈现被动低效的特点。为突破现有局限,我们设计并实现了FuxiShuffle作为MaxCompute超大规模生产环境中的通用数据Shuffle服务,其核心特性在于良好的自适应能力与高效的故障弹性。具体而言,为实现良好的自适应性,FuxiShuffle依据运行时信息动态选择Shuffle模式,对下游工作节点实施进度感知调度,并为每个Shuffle数据块自动确定最适宜的备份策略。为实现高效的故障弹性,FuxiShuffle通过多副本故障转移主动保障数据可用性,通过精细的内存管理防止内存溢出,并采用不丢失计算进度的增量恢复机制。实验表明,与基线系统相比,FuxiShuffle不仅显著降低了端到端作业完成时间,同时减少了总体资源消耗。微观实验证实,我们的设计在提升自适应性与故障弹性方面具有显著效果。