Temporal Coverage over Density: Parsimonious Training-Set Design for ML Climate Downscaling

High-resolution regional climate simulations provide critical information for climate impacts assessments but remain computationally expensive, motivating the development of machine-learning downscalers and emulators. A key challenge is determining how limited high-resolution simulations should be distributed across a changing climate trajectory to capture both forced climate response and internal variability. Using the CESM2 Large Ensemble over the western United States, we compare three training-year selection strategies under fixed data budgets: a contiguous block of historical years, years drawn from both the beginning and end of the simulation period, and years distributed throughout the full climate trajectory. Including both historical and future years consistently outperforms training on historical years alone, demonstrating the importance of exposing downscaling models to climate states outside the historical record and highlighting limitations of stationarity assumptions common in statistical downscaling. Training on years distributed throughout the full climate trajectory performs best overall, indicating that broad sampling of internal variability provides additional information beyond exposure to the forced climate response alone. Models trained on temporally distributed subsets more successfully reproduce variability in unseen ensemble members while retaining strong performance across a wide range of climate diagnostics. Even when trained on only one-tenth of the available high-resolution years, temporally distributed models remain highly competitive with full-data training. These results suggest that, under fixed computational budgets, broad sampling of climate states is more valuable than temporal continuity when allocating scarce high-resolution simulations. The findings provide practical guidance for regional climate downscaling and large-ensemble projection workflows.

翻译：高分辨率区域气候模拟为气候影响评估提供了关键信息，但由于计算成本高昂，促使了机器学习降尺度器和模拟器的研发。一个关键挑战在于，如何将有限的高分辨率模拟合理地分布在不断变化的气候轨迹中，以捕获强迫气候响应和内部变异性。我们利用美国西部的CESM2大型集合，在固定数据预算下比较了三种训练年份选择策略：历史年份的连续块、同时从模拟时期开始和结束年份抽取的样本，以及完整气候轨迹中均匀分布的年份。包含历史和未来年份的模型始终优于仅使用历史年份训练的模型，这证明了将降尺度模型暴露于历史记录之外的气候状态的重要性，并凸显了统计降尺度中常见平稳性假设的局限性。在完整气候轨迹中均匀分布年份进行训练的模型整体表现最佳，表明广泛采样内部变异性能够提供超越单纯暴露于强迫气候响应的额外信息。基于时间分布子集训练的模型能更成功地复现未见集合成员中的变异性，同时在广泛的气候诊断指标中保持强劲性能。即使仅利用可用高分辨率年份的十分之一进行训练，时间分布模型仍与基于完整数据训练的模型高度可比。这些结果表明，在固定计算预算下，分配稀缺的高分辨率模拟时，对气候状态的广泛采样比时间连续性更具价值。该发现为区域气候降尺度和大型集合预测工作流提供了实用指导。