Training large language models (LLMs) requires substantial compute and energy. At the same time, renewable energy sources regularly produce more electricity than the grid can absorb, leading to curtailment, the deliberate reduction of clean generation that would otherwise go to waste. These periods represent an opportunity: if training is aligned with curtailment windows, LLMs can be pretrained using electricity that is both clean and cheap. This technical report presents a system that performs full-parameter LLM training across geo-distributed GPU clusters during regional curtailment windows, elastically switching between local single-site training and federated multi-site synchronization as sites become available or unavailable. Our prototype trains a 561M-parameter transformer model across three clusters using the Flower federated learning framework, with curtailment periods derived from real-world marginal carbon intensity traces. Preliminary results show that curtailment-aware scheduling preserves training quality while reducing operational emissions to 5-12% of single-site baselines.
翻译:训练大规模语言模型(LLM)需要大量的计算资源和能源消耗。与此同时,可再生能源发电量时常超出电网消纳能力,导致弃电现象——即主动削减本可利用的清洁能源发电。这些弃电时段蕴含着重要机遇:若将训练任务与弃电窗口期对齐,即可利用既清洁又廉价的电力进行大语言模型预训练。本技术报告提出一种系统,能够在区域弃电窗口期内跨地理分布式GPU集群执行全参数大语言模型训练,并根据各节点可用性状态,在本地单点训练与联邦式多点同步训练之间进行弹性切换。我们的原型系统基于Flower联邦学习框架,利用从实际边际碳强度数据推导的弃电时段,在三个集群上训练了包含5.61亿参数的Transformer模型。初步实验表明,弃电感知调度策略在保持训练质量的同时,将运行碳排放降至单点基准方案的5-12%。