Compiler Optimization for Irregular Memory Access Patterns in PGAS Programs

Irregular memory access patterns pose performance and user productivity challenges on distributed-memory systems. They can lead to fine-grained remote communication and the data access patterns are often not known until runtime. The Partitioned Global Address Space (PGAS) programming model addresses these challenges by providing users with a view of a distributed-memory system that resembles a single shared address space. However, this view often leads programmers to write code that causes fine-grained remote communication, which can result in poor performance. Prior work has shown that the performance of irregular applications written in Chapel, a high-level PGAS language, can be improved by manually applying optimizations. However, applying such optimizations by hand reduces the productivity advantages provided by Chapel and the PGAS model. We present an inspector-executor based compiler optimization for Chapel programs that automatically performs remote data replication. While there have been similar compiler optimizations implemented for other PGAS languages, high-level features in Chapel such as implicit processor affinity lead to new challenges for compiler optimization. We evaluate the performance of our optimization across two irregular applications. Our results show that the total runtime can be improved by as much as 52x on a Cray XC system with a low-latency interconnect and 364x on a standard Linux cluster with an Infiniband interconnect, demonstrating that significant performance gains can be achieved without sacrificing user productivity.

翻译：不规则内存访问模式给分布式内存系统的性能和用户生产力带来了挑战。这类模式可能导致细粒度远程通信，且数据访问模式往往在运行时才能确定。分区全局地址空间（PGAS）编程模型通过向用户提供类似单一共享地址空间的分布式内存系统视图来应对这些挑战。然而，这种视图常常导致程序员编写引发细粒度远程通信的代码，从而造成性能低下。先前的研究表明，通过手动应用优化方法，可以提升使用高级PGAS语言Chapel编写的不规则应用程序的性能。但手动应用此类优化会降低Chapel及PGAS模型所提供的生产力优势。我们提出了一种基于“检查器-执行器”模式的Chapel程序编译器优化方法，该方法能自动执行远程数据复制。尽管其他PGAS语言已实现过类似的编译器优化，但Chapel中的高级特性（如隐式处理器亲和性）给编译器优化带来了新的挑战。我们在两个不规则应用程序上评估了优化性能。结果表明，在配备低延迟互连的Cray XC系统上，总运行时间最多可提升52倍；在配备InfiniBand互连的标准Linux集群上，总运行时间最多可提升364倍，这充分证明在无需牺牲用户生产力的前提下，能够实现显著的性能提升。