An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience

Jonathan Coles,Stefano Schuppli,Lukas Drescher,Fawzi Roberto Mohamed,Elia Palme,Henrique Mendonça,Miguel Gila,Mark Klein,Maxime Martinasso,Joost VandeVondele,Torsten Hoefler,Thomas Schulthess,Josh Romero,Igor Gorodetsky,Ryan Hankins,Isa Wazirzada,Martin Jaggi,Antoine Bosselut,Imanol Schlag,Antoni-Joan Solergibert i Llaquet,Alejandro Hernández Cano,Theofilos Ioannis Manitaras,Nicholas John Browning

Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training Apertus, a fully open multilingual foundation model, on the Alps supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training AI models, from overcoming storage bottlenecks to stabilizing large-scale interconnects, and the lessons learned in transforming a supercomputer into a resilient software-defined Machine Learning Platform. Finally, we discuss the post-training requirements and evolution of our Machine Learning platform, outlining how this initial release lays the groundwork for a sustained, iterative operational capability, in particular for fine tuning foundation models, that extends well beyond a single model training run.

翻译：大语言模型（LLMs）作为一项变革性技术已深刻影响科学与社会，促使世界各国政府致力于发展能够确保数据合规与文化代表性的主权人工智能能力。然而，训练这些模型所需的高昂资本成本与工程复杂性，使得此类能力主要局限于私营部门，公共机构面临显著的能力缺口。本文详细介绍了在阿尔卑斯超级计算机上训练Apertus（一个完全开源的多语言基础模型）的工程历程。作为学术界在700亿参数规模上首次实现的突破性成果，我们成功地在欧洲最大的开放科学系统之一（基于NVIDIA GH200 Grace Hopper超级芯片）上部署了大规模预训练任务。我们阐述了为训练人工智能模型而准备高性能计算基础设施时遇到的各种挑战——从克服存储瓶颈到稳定大规模互连，以及将超级计算机转变为弹性软件定义机器学习平台过程中积累的经验教训。最后，我们讨论了机器学习平台的后训练需求与演进过程，阐明此次初始发布如何为建立持续迭代的运维能力奠定基础，尤其是针对基础模型的微调，其影响力将远超单次模型训练任务。