The increase in computation and storage has led to a significant growth in the scale of systems powering applications and services, raising concerns about sustainability and operational costs. In this paper, we explore power-saving techniques in high-performance computing (HPC) and datacenter networks, and their relation with performance degradation. From this premise, we propose leveraging Energy Efficient Ethernet (EEE) protocol, with the flexibility to extend to conventional Ethernet or upcoming Ethernet-derived interconnect versions of BXI and Omnipath. We analyze the PerfBound power-saving mechanism, identifying possible improvements and modeling it into a simulation framework. Through different experiments, we examine its impact on performance and determine the most appropriate interconnect. We also study traffic patterns generated by selected HPC and machine learning applications to evaluate the behavior of power-saving techniques. From these experiments, we provide an analysis of how applications affect system and network energy consumption. Based on this, we disclose the weakness of dynamic power-down mechanisms and propose an approach that improves energy reduction with minimal or no performance penalty. To the best of our knowledge, this work presents the first thorough analysis of PerfBound and an enhancement to the technique, while also targeting emerging post-exascale networks.
翻译:随着计算与存储需求的增长,支撑应用与服务的系统规模显著扩大,引发了人们对可持续性与运营成本的关注。本文探讨高性能计算(HPC)与数据中心网络的节能技术及其与性能损耗的关系。基于此前提,我们提出利用能效以太网(EEE)协议,并可将该方法灵活扩展至传统以太网或即将推出的BXI与Omnipath等以太网衍生互连版本。我们分析了PerfBound节能机制,识别其可能的改进方向,并将其建模至仿真框架中。通过多组实验,我们检验了该机制对性能的影响并确定了最适宜的互连方案。同时,我们研究了选定的HPC与机器学习应用所产生的流量模式,以评估节能技术的实际表现。基于这些实验,我们分析了应用程序如何影响系统与网络的能耗。据此,我们揭示了动态降功耗机制的不足,并提出一种能够在最小化或避免性能损失的前提下提升节能效果的新方法。据我们所知,本研究首次对PerfBound机制进行了全面分析并提出了改进方案,同时面向新兴的后E级计算网络。