The increase in computation and storage has led to a significant growth in the scale of systems powering applications and services, raising concerns about sustainability and operational costs. In this paper, we explore power-saving techniques in high-performance computing (HPC) and datacenter networks, and their relation with performance degradation. From this premise, we propose leveraging Energy Efficient Ethernet (EEE) protocol, with the flexibility to extend to conventional Ethernet or upcoming Ethernet-derived interconnect versions of BXI and Omnipath. We analyze the PerfBound power-saving mechanism, identifying possible improvements and modeling it into a simulation framework. Through different experiments, we examine its impact on performance and determine the most appropriate interconnect. We also study traffic patterns generated by selected HPC and machine learning applications to evaluate the behavior of power-saving techniques. From these experiments, we provide an analysis of how applications affect system and network energy consumption. Based on this, we disclose the weakness of dynamic power-down mechanisms and propose an approach that improves energy reduction with minimal or no performance penalty. To the best of our knowledge, this work presents the first thorough analysis of PerfBound and an enhancement to the technique, while also targeting emerging post-exascale networks.
翻译:计算与存储需求的增长导致支撑应用与服务的系统规模显著扩大,引发了可持续性及运营成本的担忧。本文探讨了高性能计算与数据中心网络中的节能技术及其与性能退化之间的关系。基于此,我们提出利用能效以太网协议,并可灵活扩展至传统以太网或即将推出的BXI与Omnipath等基于以太网的衍生互连版本。通过分析PerfBound节能机制,我们识别了其潜在改进方向并将其建模至仿真框架中。通过不同实验,我们考察了该机制对性能的影响并确定了最适配的互连方案。同时,我们选取典型HPC与机器学习应用生成的流量模式,评估节能技术的行为特征。基于这些实验,我们分析了应用对系统及网络能耗的影响规律,揭示了动态关断机制的缺陷,并提出了一种在极小或零性能损失下提升节能效果的方法。据我们所知,本研究首次完成了对PerfBound的全面分析及其技术增强,同时针对新兴的后百亿亿次网络体系展开探讨。