Vision Transformers for Multi-Variable Climate Downscaling: Emulating Regional Climate Models with a Shared Encoder and Multi-Decoder Architecture

Global Climate Models (GCMs) are critical for simulating large-scale climate dynamics, but their coarse spatial resolution limits their applicability in regional studies. Regional Climate Models (RCMs) address this limitation through dynamical downscaling, albeit at considerable computational cost and with limited flexibility. Deep learning has emerged as an efficient data-driven alternative; however, most existing approaches focus on single-variable models that downscale one variable at a time. This paradigm can lead to redundant computation, limited contextual awareness, and weak cross-variable interactions.To address these limitations, we propose a multi-variable Vision Transformer (ViT) architecture with a shared encoder and variable-specific decoders (1EMD). The proposed model jointly predicts six key climate variables: surface temperature, wind speed, 500 hPa geopotential height, total precipitation, surface downwelling shortwave radiation, and surface downwelling longwave radiation, directly from GCM-resolution inputs, emulating RCM-scale downscaling over Europe. Compared to single-variable ViT models, the 1EMD architecture improves performance across all six variables, achieving an average MSE reduction of approximately 5.5% under a fair and controlled comparison. It also consistently outperforms alternative multi-variable baselines, including a single-decoder ViT and a multi-variable U-Net. Moreover, multi-variable models substantially reduce computational cost, yielding a 29-32% lower inference time per variable compared to single-variable approaches. Overall, our results demonstrate that multi-variable modeling provides systematic advantages for high-resolution climate downscaling in terms of both accuracy and efficiency. Among the evaluated architectures, the proposed 1EMD ViT achieves the most favorable trade-off between predictive performance and computational cost.

翻译：全球气候模型（GCMs）对于模拟大尺度气候动力学至关重要，但其较低的空间分辨率限制了其在区域研究中的应用。区域气候模型（RCMs）通过动力降尺度解决了这一局限，但计算成本高昂且灵活性有限。深度学习已成为一种高效的数据驱动替代方案；然而，现有方法大多聚焦于单变量模型，即每次仅对一个变量进行降尺度。这种范式可能导致冗余计算、有限的上下文感知能力以及较弱的跨变量交互作用。为应对这些局限，我们提出了一种具有共享编码器和变量特定解码器（1EMD）的多变量视觉Transformer（ViT）架构。该模型联合预测六个关键气候变量：地表温度、风速、500 hPa位势高度、总降水量、地表向下短波辐射和地表向下长波辐射，直接基于GCM分辨率输入，模拟欧洲区域的RCM尺度降尺度。与单变量ViT模型相比，1EMD架构在所有六个变量上均提升了性能，在公平受控的比较下实现了平均约5.5%的均方误差降低。该模型也持续优于其他多变量基线方法，包括单解码器ViT和多变量U-Net。此外，多变量模型显著降低了计算成本，与单变量方法相比，每个变量的推理时间减少了29-32%。总体而言，我们的结果表明，多变量建模在高分辨率气候降尺度方面，在精度和效率上均提供了系统性优势。在所评估的架构中，提出的1EMD ViT在预测性能与计算成本之间实现了最优权衡。