The impact of internal variability on benchmarking deep learning climate emulators

Full-complexity Earth system models (ESMs) are computationally very expensive, limiting their use in exploring the climate outcomes of multiple emission pathways. More efficient emulators that approximate ESMs can directly map emissions onto climate outcomes, and benchmarks are being used to evaluate their accuracy on standardized tasks and datasets. We investigate a popular benchmark in data-driven climate emulation, ClimateBench, on which deep learning-based emulators are currently achieving the best performance. We compare these deep learning emulators with a linear regression-based emulator, akin to pattern scaling, and show that it outperforms the incumbent 100M-parameter deep learning foundation model, ClimaX, on 3 out of 4 regionally-resolved climate variables, notably surface temperature and precipitation. While emulating surface temperature is expected to be predominantly linear, this result is surprising for emulating precipitation. Precipitation is a much more noisy variable, and we show that deep learning emulators can overfit to internal variability noise at low frequencies, degrading their performance in comparison to a linear emulator. We address the issue of overfitting by increasing the number of climate simulations per emission pathway (from 3 to 50) and updating the benchmark targets with the respective ensemble averages from the MPI-ESM1.2-LR model. Using the new targets, we show that linear pattern scaling continues to be more accurate on temperature, but can be outperformed by a deep learning-based technique for emulating precipitation. We publish our code and data at github.com/blutjens/climate-emulator.

翻译：完整复杂度的地球系统模型（ESM）计算成本极高，限制了其在探索多种排放路径下气候结果中的应用。能够近似ESM的更高效模拟器可以直接将排放映射到气候结果，而基准测试正被用于评估其在标准化任务和数据集上的准确性。我们研究了数据驱动气候模拟领域的一个流行基准——ClimateBench，目前基于深度学习的模拟器在该基准上取得了最佳性能。我们将这些深度学习模拟器与一种类似于模式缩放的线性回归模拟器进行比较，结果表明，在4个区域解析气候变量中的3个（特别是地表温度和降水）上，线性模拟器的性能优于当前主流的1亿参数深度学习基础模型ClimaX。虽然模拟地表温度预期主要呈线性关系，但这一结果在模拟降水方面令人意外。降水是一个噪声更强的变量，我们证明深度学习模拟器可能在低频处过拟合内部变率噪声，导致其性能相较于线性模拟器下降。我们通过增加每条排放路径下的气候模拟数量（从3次增至50次）并使用MPI-ESM1.2-LR模型的相应集合平均值更新基准目标，解决了过拟合问题。使用新目标后，我们发现线性模式缩放在温度模拟上仍更准确，但在降水模拟方面可能被基于深度学习的技术超越。我们的代码和数据发布于github.com/blutjens/climate-emulator。