SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center. It builds on the KOKARYOKU PHY bare metal GPU platform and is optimized for advanced workloads, including large language model (LLM) training. In ISC 2025 TOP500, SAKURAONE is ranked 49th by HPL and is the only top 100 system that uses a fully open networking stack - 800 GbE with SONiC - demonstrating the scalability of vendor-neutral technology. Measured performance is 33.95 PFLOP/s (HPL Rmax), 396.295 TFLOP/s (HPCG), and 339.86 PFLOP/s on HPL-MxP with FP8. The system consists of 100 nodes, each with eight NVIDIA H100 GPUs and a 2 PB all-flash Lustre file system, interconnected via a rail-optimized 800 GbE leaf-spine fabric with RoCEv2. Through exclusive use by a single research project, we observed the characteristics of development-related jobs. Consistent with previous HPC studies, small-scale jobs dominated in number, while a few large-scale jobs accounted for most GPU resource time. As the project progressed, resource use shifted from large-scale to mid-scale jobs, reflecting a transition from initial large-scale training to iterative refinement. These observations illustrate the real-world utilization dynamics of GPU clusters under unified project workloads.
翻译:SAKURAONE是由SAKURA互联网研究中心开发并运营的托管式高性能计算集群。该系统基于KOKARYOKU PHY裸金属GPU平台构建,针对包括大语言模型训练在内的先进工作负载进行了优化。在ISC 2025 TOP500榜单中,SAKURAONE凭借HPL性能排名第49位,并且是前100名系统中唯一采用完全开放网络协议栈——基于SONiC的800GbE技术的系统,充分证明了供应商中立技术的可扩展性。实测性能指标为:HPL Rmax 33.95 PFLOP/s、HPCG 396.295 TFLOP/s、HPL-MxP(FP8)339.86 PFLOP/s。系统由100个节点组成,每个节点配备八块NVIDIA H100 GPU,并搭载2 PB全闪存Lustre文件系统,通过基于RoCEv2的轨道优化型800GbE叶脊网络架构实现互联。通过单一研究项目的独占使用,我们观测到开发相关作业的特征。与先前HPC研究一致,小规模作业在数量上占主导地位,而少数大规模作业消耗了绝大部分GPU资源时间。随着项目推进,资源使用从大规模作业转向中等规模作业,反映了从初始大规模训练到迭代优化的过渡过程。这些观测结果揭示了统一项目工作负载下GPU集群的实际利用动态。