Sampling Parallelism for Fast and Efficient Bayesian Learning

Machine learning models, and deep neural networks in particular, are increasingly deployed in risk-sensitive domains such as healthcare, environmental forecasting, and finance, where reliable quantification of predictive uncertainty is essential. However, many uncertainty quantification (UQ) methods remain difficult to apply due to their substantial computational cost. Sampling-based Bayesian learning approaches, such as Bayesian neural networks (BNNs), are particularly expensive since drawing and evaluating multiple parameter samples rapidly exhausts memory and compute resources. These constraints have limited the accessibility and exploration of Bayesian techniques thus far. To address these challenges, we introduce sampling parallelism, a simple yet powerful parallelization strategy that targets the primary bottleneck of sampling-based Bayesian learning: the samples themselves. By distributing sample evaluations across multiple GPUs, our method reduces memory pressure and training time without requiring architectural changes or extensive hyperparameter tuning. We detail the methodology and evaluate its performance on a few example tasks and architectures, comparing against distributed data parallelism (DDP) as a baseline. We further demonstrate that sampling parallelism is complementary to existing strategies by implementing a hybrid approach that combines sample and data parallelism. Our experiments show near-perfect scaling when the sample number is scaled proportionally to the computational resources, confirming that sample evaluations parallelize cleanly. Although DDP achieves better raw speedups under scaling with constant workload, sampling parallelism has a notable advantage: by applying independent stochastic augmentations to the same batch on each GPU, it increases augmentation diversity and thus reduces the number of epochs required for convergence.

翻译：机器学习模型，尤其是深度神经网络，越来越多地部署在医疗、环境预测和金融等风险敏感领域，这些领域对预测不确定性的可靠量化至关重要。然而，许多不确定性量化方法因其高昂的计算成本而难以应用。基于采样的贝叶斯学习方法（如贝叶斯神经网络）尤其昂贵，因为生成和评估多个参数样本会迅速耗尽内存和计算资源。这些限制迄今为止影响了贝叶斯技术的可及性与探索。为解决这些挑战，我们提出了采样并行化——一种简单而强大的并行化策略，直击基于采样的贝叶斯学习的主要瓶颈：样本本身。通过将样本评估分布到多个GPU上，我们的方法减少了内存压力并缩短了训练时间，无需修改架构或进行大量超参数调优。我们详细阐述了该方法论，并在若干示例任务和架构上评估其性能，以分布式数据并行化作为基线进行比较。我们进一步通过实现一种结合样本并行化与数据并行化的混合方法，证明了采样并行化与现有策略的互补性。实验表明，当样本数量与计算资源成比例扩展时，可实现近乎完美的扩展性，证实了样本评估的并行化非常干净利落。尽管在恒定工作负载扩展下，分布式数据并行化能实现更优的原始加速比，但采样并行化有一个显著优势：通过在每个GPU上对同一批次应用独立的随机增广，它增加了增广多样性，从而减少了收敛所需的训练轮次。