The ARCHER2 service, a CPU based HPE Cray EX system with 750,080 cores (5,860 nodes), has been deployed throughout 2020 and 2021, going into full service in December of 2021. A key part of the work during this deployment was the integration of ARCHER2 into our local monitoring systems. As ARCHER2 was one of the very first large-scale EX deployments, this involved close collaboration and development work with the HPE team through a global pandemic situation where collaboration and co-working was significantly more challenging than usual. The deployment included the creation of automated checks and visual representations of system status which needed to be made available to external parties for diagnosis and interpretation. We will describe how these checks have been deployed and how data gathered played a key role in the deployment of ARCHER2, the commissioning of the plant infrastructure, the conduct of HPL runs for submission to the Top500 and contractual monitoring of the availability of the ARCHER2 service during its commissioning and early life.
翻译:ARCHER2服务是一套基于CPU的HPE Cray EX系统,配备750,080个核心(5,860个节点),于2020年至2021年期间逐步部署,并于2021年12月全面投入使用。在此部署过程中,一项关键工作是将ARCHER2集成到本地监控系统中。由于ARCHER2是首批大规模EX系统部署之一,这要求我们在全球疫情背景下与HPE团队紧密协作并开展开发工作,而疫情使得协作与协同工作比往常更加困难。部署工作包括创建自动化检查项和系统状态可视化表示,这些内容需提供给外部团队用于诊断和解读。本文将描述这些检查项的部署方式,以及所收集数据如何在ARCHER2的部署、基础设施的调试、提交至Top500的HPL基准测试执行,以及在调试和早期运行阶段对ARCHER2服务可用性的合同监控中发挥关键作用。