We introduce Diffuse, a system that dynamically performs task and kernel fusion in distributed, task-based runtime systems. The key component of Diffuse is an intermediate representation of distributed computation that enables the necessary analyses for the fusion of distributed tasks to be performed in a scalable manner. We pair task fusion with a JIT compiler to fuse together the kernels within fused tasks. We show empirically that Diffuse's intermediate representation is general enough to be a target for two real-world, task-based libraries (cuNumeric and Legate Sparse), letting Diffuse find optimization opportunities across function and library boundaries. Diffuse accelerates unmodified applications developed by composing task-based libraries by 1.86x on average (geo-mean), and by between 0.93x--10.7x on up to 128 GPUs. Diffuse also finds optimization opportunities missed by the original application developers, enabling high-level Python programs to match or exceed the performance of an explicitly parallel MPI library.
翻译:我们提出了Diffuse系统,该系统能够在分布式、基于任务的运行时系统中动态执行任务与内核融合。Diffuse的核心组件是一种分布式计算的中间表示,该表示支持以可扩展方式对分布式任务融合进行必要分析。我们将任务融合与即时编译器相结合,以实现融合任务内部内核的进一步融合。实验表明,Diffuse的中间表示具有足够的通用性,可作为两个实际任务型计算库(cuNumeric与Legate Sparse)的编译目标,使Diffuse能够跨越函数和库边界发现优化机会。Diffuse对基于任务库组合开发的未修改应用程序实现了平均1.86倍(几何平均)的加速效果,在128个GPU上的加速比范围为0.93倍至10.7倍。Diffuse还能发现原应用开发者未察觉的优化机会,使高级Python程序的性能达到或超越显式并行MPI库的水平。