In environmental studies, realistic simulations are essential for understanding complex systems. Statistical emulation with Gaussian processes (GPs) in functional data models have become a standard tool for this purpose. Traditional centralized processing of such models requires substantial computational and storage resources, leading to emerging distributed Bayesian learning algorithms that partition data into shards for distributed computations. However, concerns about the sensitivity of distributed inference to shard selection arise. Instead of using data shards, our approach employs multiple random matrices to create random linear projections, or sketches, of the dataset. Posterior inference on functional data models is conducted using random data sketches on various machines in parallel. These individual inferences are combined across machines at a central server. The aggregation of inference across random matrices makes our approach resilient to the selection of data sketches, resulting in robust distributed Bayesian learning. An important advantage is its ability to maintain the privacy of sampling units, as random sketches prevent the recovery of raw data. We highlight the significance of our approach through simulation examples and showcase the performance of our approach as an emulator using surrogates of the Sea, Lake, and Overland Surges from Hurricanes (SLOSH) simulator - an important simulator for government agencies.
翻译:在环境研究中,现实模拟对于理解复杂系统至关重要。基于高斯过程(GPs)的功能数据模型统计仿真已成为实现此目的的标准工具。此类模型的传统集中式处理需要大量计算和存储资源,这催生了新兴的分布式贝叶斯学习算法,其将数据划分为分片进行分布式计算。然而,分布式推断对分片选择的敏感性引发了担忧。我们的方法不使用数据分片,而是采用多个随机矩阵来创建数据集的随机线性投影(即草图)。功能数据模型的后验推断通过在各机器上并行使用随机数据草图进行。这些独立的推断结果在中央服务器上跨机器进行整合。跨随机矩阵的推断聚合使我们的方法对数据草图的选择具有鲁棒性,从而实现了稳健的分布式贝叶斯学习。一个重要优势在于其能够保持采样单元的隐私性,因为随机草图可防止原始数据的恢复。我们通过仿真示例强调了该方法的重要性,并展示了其作为仿真器的性能,该仿真器使用了飓风引发的海洋、湖泊和陆地风暴潮(SLOSH)模拟器的替代模型——这是政府机构使用的一个重要模拟器。