PolyTOPS: Reconfigurable and Flexible Polyhedral Scheduler

Gianpietro Consolaro,Zhen Zhang,Harenome Razanajato,Nelson Lossing,Nassim Tchoulak,Adilla Susungi,Artur Cesar Araujo Alves,Renwei Zhang,Denis Barthou,Corinne Ancourt,Cedric Bastoul

from arxiv, 14 pages, bibliography included. The paper has been accepted to CGO 2024 and the publication and proceedings are ongoing. This is a preprint version

Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higher-level processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set. We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario.

翻译：多面体技术已被广泛用于底层编译器及高层流程中的自动代码优化。循环优化是该技术的核心，目前已提出多种多面体调度器，如Feautrier、Pluto、isl和张量调度器，它们分别针对不同的架构、并行模型或应用场景。由于架构的异构性，针对特定场景的优化需求日益增长。最典型的场景之一是用于AI的NPU（神经网络处理单元），这类场景可能需要对循环优化设定不同目标。另一个需考虑的因素是多面体优化所依托的框架或编译器。根据目标架构、编译环境和应用领域的不同，不同场景可能需要不同类型的最优化方法，以充分利用架构特性。我们提出了一种新型可配置多面体调度器PolyTOPS，可通过简单的高层级配置适配多种场景。该调度器能生成多种调度策略，既可针对特定场景（如现有最先进的调度器），也可针对特定内核，打破了“一刀切”式调度器的概念。PolyTOPS已与isl和CLooG作为代码生成器集成，并已整合至MindSpore AKG深度学习编译器。不同场景下的实验结果表明其性能优异：在MindSpore（面向NPU昇腾架构）的混合自定义算子中，相比isl调度实现几何平均加速比7.66倍；在PolyBench不同多核架构上，相比Pluto调度实现几何平均加速比最高1.80倍。最后，本文还展示了PolyMage场景中与多种现有先进工具的对比结果。