Exploring Performance-Productivity Trade-offs in AMT Runtimes: A Task Bench Study of Itoyori, ItoyoriFBC, HPX, and MPI

Asynchronous Many-Task (AMT) runtimes offer a productive alternative to the Message Passing Interface (MPI). However, the diverse AMT landscape makes fair comparisons challenging. Task Bench, proposed by Slaughter et al., addresses this challenge through a parameterized framework for evaluating parallel programming systems. This work integrates two recent cluster AMTs, Itoyori and ItoyoriFBC, into Task Bench for comprehensive evaluation against MPI and HPX. Itoyori employs a Partitioned Global Address Space (PGAS) model with RDMA-based work stealing, while ItoyoriFBC extends it with futurebased synchronization. We evaluate these systems in terms of both performance and programmer productivity. Performance is assessed across various configurations, including compute-bound kernels, weak scaling, and both imbalanced and communication-intensive patterns. Performance is quantified using application efficiency, i.e., the percentage of maximum performance achieved, and the Minimum Effective Task Granularity (METG), i.e., the smallest task duration before runtime overheads dominate. Programmer productivity is quantified using Lines of Code (LOC) and the Number of Library Constructs (NLC). Our results reveal distinct trade-offs. MPI achieves the highest efficiency for regular, communication-light workloads but requires verbose, lowlevel code. HPX maintains stable efficiency under load imbalance across varying node counts, yet ranks last in productivity metrics, demonstrating that AMTs do not inherently guarantee improved productivity over MPI. Itoyori achieves the highest efficiency in communication-intensive configurations while leading in programmer productivity. ItoyoriFBC exhibits slightly lower efficiency than Itoyori, though its future-based synchronization offers potential for expressing irregular workloads.

翻译：异步多任务运行时系统为消息传递接口提供了一种更具开发效率的替代方案。然而，AMT生态的多样性使得公平比较面临挑战。Slaughter等人提出的Task Bench通过参数化评估框架应对这一挑战，用于评测并行编程系统。本研究将两个新兴集群AMT系统——Itoyori与ItoyoriFBC——集成至Task Bench框架，与MPI及HPX进行全面对比评估。Itoyori采用基于RDMA工作窃取的划分全局地址空间模型，而ItoyoriFBC在此基础上扩展了基于future的同步机制。我们从性能与开发效率两个维度评估这些系统：性能评估涵盖计算密集型内核、弱扩展场景、负载不均衡及通信密集型模式等多种配置，通过应用效率（即达到最大性能的百分比）和最小有效任务粒度（即运行时开销开始占主导前的最小任务持续时间）进行量化；开发效率则通过代码行数与库构造数量进行量化。实验结果表明显著的权衡关系：MPI在规则且通信稀疏的工作负载中实现最高效率，但需要编写冗长的底层代码；HPX在不同节点数量的负载不均衡场景下保持稳定效率，但在开发效率指标中排名末位，说明AMT并非必然比MPI更具开发效率优势；Itoyori在通信密集型配置中达到最高效率，同时在开发效率方面领先；ItoyoriFBC虽效率略低于Itoyori，但其基于future的同步机制为表达不规则工作负载提供了潜在优势。