Here, we test the performance and scalability of fully-asynchronous, best-effort communication on existing, commercially-available HPC hardware. A first set of experiments tested whether best-effort communication strategies can benefit performance compared to the traditional perfect communication model. At high CPU counts, best-effort communication improved both the number of computational steps executed per unit time and the solution quality achieved within a fixed-duration run window. Under the best-effort model, characterizing the distribution of quality of service across processing components and over time is critical to understanding the actual computation being performed. Additionally, a complete picture of scalability under the best-effort model requires analysis of how such quality of service fares at scale. To answer these questions, we designed and measured a suite of quality of service metrics: simulation update period, message latency, message delivery failure rate, and message delivery coagulation. Under a lower communication-intensivity benchmark parameterization, we found that median values for all quality of service metrics were stable when scaling from 64 to 256 process. Under maximal communication intensivity, we found only minor -- and, in most cases, nil -- degradation in median quality of service. In an additional set of experiments, we tested the effect of an apparently faulty compute node on performance and quality of service. Despite extreme quality of service degradation among that node and its clique, median performance and quality of service remained stable.
翻译:本文测试了完全异步“尽力通信”在现有商用高性能计算硬件上的性能与可扩展性。第一组实验评估了相比传统完美通信模型,尽力通信策略能否带来性能提升。在高CPU核心数条件下,尽力通信不仅提升了单位时间内执行的计算步骤数量,还在固定时长运行窗口内提高了所达到的求解质量。在尽力通信模型下,刻画各处理组件间以及随时间变化的服务质量分布,对于理解实际执行的计算至关重要。此外,要全面了解尽力通信模型的可扩展性,还需分析此类服务质量在大规模下的表现。为解答这些问题,我们设计并测量了一组服务质量指标:模拟更新周期、消息延迟、消息投递失败率以及消息投递凝聚度。在低通信密集度基准参数化条件下,我们发现当进程数从64扩展至256时,所有服务质量指标的中位数保持稳定。在最高通信密集度下,服务质量中位数的降级仅为轻微——多数情况下为零。在另一组实验中,我们测试了一个明显故障的计算节点对性能和服务质量的影响。尽管该节点及其邻域集群的服务质量出现极端降级,但中位数性能与服务质量仍保持稳定。