SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center. It builds on the KOKARYOKU PHY bare metal GPU platform and is optimized for advanced workloads, including large language model (LLM) training. In ISC 2025 TOP500, SAKURAONE is ranked 49th by HPL and is the only top 100 system that uses a fully open networking stack - 800 GbE with SONiC - demonstrating the scalability of vendor-neutral technology. Measured performance is 33.95 PFLOP/s (HPL Rmax), 396.295 TFLOP/s (HPCG), and 339.86 PFLOP/s on HPL-MxP with FP8. The system consists of 100 nodes, each with eight NVIDIA H100 GPUs and a 2 PB all-flash Lustre file system, interconnected via a rail-optimized 800 GbE leaf-spine fabric with RoCEv2. Through exclusive use by a single research project, we observed the characteristics of development-related jobs. Consistent with previous HPC studies, small-scale jobs dominated in number, while a few large-scale jobs accounted for most GPU resource time. As the project progressed, resource use shifted from large-scale to mid-scale jobs, reflecting a transition from initial large-scale training to iterative refinement. These observations illustrate the real-world utilization dynamics of GPU clusters under unified project workloads.
翻译:SAKURAONE是由樱互联网研究中心开发与运维的受管高性能计算(HPC)集群。该系统基于KOKARYOKU PHY裸金属GPU平台构建,专为大语言模型(LLM)训练等高级工作负载优化。在ISC 2025 TOP500榜单中,SAKURAONE凭借HPL性能位列第49名,是前100名中唯一采用全开放网络协议栈(基于SONiC的800GbE)的系统,充分彰显了供应商无关技术的可扩展性。其实测性能指标为:HPL Rmax达33.95 PFLOP/s,HPCG达396.295 TFLOP/s,HPL-MxP(FP8格式)达339.86 PFLOP/s。系统包含100个节点,每节点配备八块NVIDIA H100 GPU,搭载2PB全闪存Lustre文件系统,并通过基于RoCEv2的轨道优化800GbE脊叶架构实现互联。通过单一研究项目的专有使用,我们观测到开发相关作业的典型特征。与既往HPC研究一致,小规模作业在数量上占优,而少数大规模作业则消耗了大部分GPU资源时间。随着项目推进,资源使用重心从大规模作业转向中等规模作业,反映了从初期大规模训练向迭代优化阶段的转变。这些观测结果揭示了统一项目工作负载下GPU集群的真实使用动态。