Recent High-Performance Computing (HPC) systems are facing important challenges, such as massive power consumption, while at the same time significantly under-utilized system resources. Given the power consumption trends, future systems will be deployed in an over-provisioned manner where more resources are installed than they can afford to power simultaneously. In such a scenario, maximizing resource utilization and energy efficiency, while keeping a given power constraint, is pivotal. Driven by this observation, in this position paper we first highlight the recent trends of resource management techniques, with a particular focus on malleability support (i.e., dynamically scaling resource allocations/requirements for a job), co-scheduling (i.e., co-locating multiple jobs within a node), and power management. Second, we consider putting them together, assess their relationships/synergies, and discuss the functionality requirements in each software component for future over-provisioned and power-constrained HPC systems. Third, we briefly introduce our ongoing efforts on the integration of software tools, which will ultimately lead to the convergence of malleability and power management, as it is designed in the HPC PowerStack initiative.
翻译:近期高性能计算(HPC)系统正面临重大挑战,例如功耗巨大而系统资源利用率显著不足。鉴于功耗趋势,未来系统将以超配方式部署,即安装的资源数量超出其能同时供电的容量。在此情境下,如何在保持给定功耗约束的同时最大化资源利用率和能效至关重要。基于此观察,在本文中我们首先重点阐述资源管理技术的最新趋势,特别关注可塑性支持(即动态调整作业的资源分配/需求)、协同调度(即在节点内共置多个作业)及功耗管理。其次,我们考虑将这些技术整合,评估其相互关系/协同效应,并讨论未来超配及功耗受限HPC系统中各软件组件的功能需求。第三,我们简要介绍软件工具集成的当前进展,这将最终推动可塑性计算与功耗管理的融合,正如HPC PowerStack计划的设计目标。