Weather and climate models rely on parametrisations to represent unresolved sub-grid processes. Traditional schemes rely on fixed coefficients that are weakly constrained and tuned offline, contributing to persistent biases that limit their ability to adapt to underlying physics. This study presents a framework that learns components of parametrisation schemes online as a function of the evolving model state using reinforcement learning (RL) and evaluates RL-driven parameter updates across idealised testbeds spanning a simple climate bias correction (SCBC), a radiative-convective equilibrium (RCE), and a zonal mean energy balance model (EBM) with single-agent and federated multi-agent settings. Across nine RL algorithms, Truncated Quantile Critics (TQC), Deep Deterministic Policy Gradient (DDPG), and Twin Delayed DDPG (TD3) achieved the highest skill and stable convergence, with performance assessed against a static baseline using area-weighted RMSE, temperature and pressure-level diagnostics. For the EBM, single-agent RL outperformed static parameter tuning with the strongest gains in tropical and mid-latitude bands, while federated RL on multi-agent setups enabled specialised control and faster convergence, with a six-agent DDPG configuration using frequent aggregation yielding the lowest area-weighted RMSE across the tropics and mid-latitudes. The learnt corrections were also physically meaningful as agents modulated EBM radiative parameters to reduce meridional biases, adjusted RCE lapse rates to match vertical temperature errors, and stabilised heating increments to limit drift. Overall, results show that RL can learn skilful state-dependent parametrisation components in idealised settings, offering a scalable pathway for online learning within numerical models and a starting point for evaluation in weather and climate models.
翻译:天气与气候模型依赖参数化方案来表征未解析的次网格过程。传统方案采用受弱约束且离线调谐的固定系数,这导致持续存在的偏差,限制了其适应底层物理过程的能力。本研究提出一个框架,利用强化学习(RL)在线学习参数化方案中随模型状态演变的组件,并在跨越简单气候偏差校正(SCBC)、辐射对流平衡(RCE)和纬向平均能量平衡模型(EBM)的理想化测试平台上,采用单智能体和联邦多智能体设置评估RL驱动的参数更新。在九种RL算法中,截断分位数评价器(TQC)、深度确定性策略梯度(DDPG)和双延迟DDPG(TD3)实现了最高技能和稳定收敛,其性能通过基于面积加权均方根误差、温度和气压层诊断指标与静态基线进行对比评估。对于EBM,单智能体RL优于静态参数调谐,在热带和中纬度区域增益最为显著;而多智能体设置下的联邦RL实现了专业化控制和更快的收敛,其中采用频繁聚合的六智能体DDPG配置在热带和中纬度地区取得了最低的面积加权均方根误差。学习到的校正也具有物理意义:智能体通过调制EBM辐射参数以减少经向偏差,调整RCE递减率以匹配垂直温度误差,并稳定加热增量以限制漂移。总体而言,结果表明RL能够在理想化设置中学习高技能的状态依赖参数化组件,为数值模型中的在线学习提供了可扩展的路径,并为在天气与气候模型中的评估奠定了基础。