ChaosBench: A Multi-Channel, Physics-Based Benchmark for Subseasonal-to-Seasonal Climate Prediction

Accurate prediction of climate in the subseasonal-to-seasonal scale is crucial for disaster preparedness and robust decision making amidst climate change. Yet, forecasting beyond the weather timescale is challenging because it deals with problems other than initial condition, including boundary interaction, butterfly effect, and our inherent lack of physical understanding. At present, existing benchmarks tend to have shorter forecasting range of up-to 15 days, do not include a wide range of operational baselines, and lack physics-based constraints for explainability. Thus, we propose ChaosBench, a challenging benchmark to extend the predictability range of data-driven weather emulators to S2S timescale. First, ChaosBench is comprised of variables beyond the typical surface-atmospheric ERA5 to also include ocean, ice, and land reanalysis products that span over 45 years to allow for full Earth system emulation that respects boundary conditions. We also propose physics-based, in addition to deterministic and probabilistic metrics, to ensure a physically-consistent ensemble that accounts for butterfly effect. Furthermore, we evaluate on a diverse set of physics-based forecasts from four national weather agencies as baselines to our data-driven counterpart such as ViT/ClimaX, PanguWeather, GraphCast, and FourCastNetV2. Overall, we find methods originally developed for weather-scale applications fail on S2S task: their performance simply collapse to an unskilled climatology. Nonetheless, we outline and demonstrate several strategies that can extend the predictability range of existing weather emulators, including the use of ensembles, robust control of error propagation, and the use of physics-informed models. Our benchmark, datasets, and instructions are available at https://leap-stc.github.io/ChaosBench.

翻译：在气候变化背景下，对次季节至季节时间尺度气候的准确预测对于灾害防范和稳健决策至关重要。然而，超越天气时间尺度的预测极具挑战性，因为它不仅涉及初始条件问题，还包括边界相互作用、蝴蝶效应以及我们固有的物理认知局限。目前，现有基准的预测范围通常较短（最多15天），未能涵盖广泛的业务化基线，且缺乏基于物理的可解释性约束。为此，我们提出ChaosBench——一个旨在将数据驱动天气模拟器的可预测范围扩展至S2S时间尺度的挑战性基准。首先，ChaosBench不仅包含典型地表-大气ERA5变量，还整合了海洋、冰盖和陆地再分析数据产品，跨越45年时间跨度，以实现尊重边界条件的完整地球系统模拟。我们除了提出确定性与概率性评估指标外，还引入基于物理的度量标准，以确保构建物理自洽且考虑蝴蝶效应的集合预报系统。此外，我们选取了来自四个国家气象机构的多样化物理预报作为基线，与ViT/ClimaX、PanguWeather、GraphCast和FourCastNetV2等数据驱动模型进行对比评估。总体而言，我们发现原本为天气尺度设计的模型在S2S任务中表现失效：其预测性能会退化至无技巧的气候态基准。尽管如此，我们系统阐述并验证了若干可扩展现有天气模拟器可预测范围的技术策略，包括集合预报应用、误差传播的鲁棒控制以及物理信息模型的运用。本基准框架、数据集及使用指南已发布于https://leap-stc.github.io/ChaosBench。