Isambard-AI is a new, leadership-class supercomputer, designed to support AI-related research. Based on the HPE Cray EX4000 system, and housed in a new, energy efficient Modular Data Centre in Bristol, UK, Isambard-AI employs 5,448 NVIDIA Grace-Hopper GPUs to deliver over 21 ExaFLOP/s of 8-bit floating point performance for LLM training, and over 250 PetaFLOP/s of 64-bit performance, for under 5MW. Isambard-AI integrates two, all-flash storage systems: a 20 PiByte Cray ClusterStor and a 3.5 PiByte VAST solution. Combined these give Isambard-AI flexibility for training, inference and secure data accesses and sharing. But it is the software stack where Isambard-AI will be most different from traditional HPC systems. Isambard-AI is designed to support users who may have been using GPUs in the cloud, and so access will more typically be via Jupyter notebooks, MLOps, or other web-based, interactive interfaces, rather than the approach used on traditional supercomputers of sshing into a system before submitting jobs to a batch scheduler. Its stack is designed to be quickly and regularly upgraded to keep pace with the rapid evolution of AI software, with full support for containers. Phase 1 of Isambard-AI is due online in May/June 2024, with the full system expected in production by the end of the year.
翻译:伊桑巴德-AI是一台新型领导级超级计算机,专为支持人工智能相关研究而设计。该系统基于HPE Cray EX4000架构,部署于英国布里斯托尔的新型节能模块化数据中心,采用5,448个NVIDIA Grace-Hopper GPU,可提供超过21 ExaFLOP/s的8位浮点性能用于大语言模型训练,并以低于5兆瓦的功耗实现超过250 PetaFLOP/s的64位计算性能。伊桑巴德-AI集成两套全闪存存储系统:20 PiB容量的Cray ClusterStor与3.5 PiB容量的VAST解决方案。二者的结合为训练、推理及安全数据访问共享提供了灵活支持。但伊桑巴德-AI与传统高性能计算系统的最大差异在于软件栈设计。该系统专门面向习惯使用云端GPU的用户群体,访问方式将主要通过Jupyter notebook、MLOps或其他基于网络的交互界面实现,而非传统超算通过SSH登录后向批处理调度器提交作业的模式。其软件栈支持容器化技术,可快速定期升级以紧跟人工智能软件的快速发展。伊桑巴德-AI的第一阶段计划于2024年5月至6月上线,完整系统预计在年底前投入运行。