Adaptive Bayesian computation for efficient biobank-scale genomic inference

Motivation: Modern biobanks, with unprecedented sample sizes and phenotypic diversity, have become foundational resources for genomic studies, enabling powerful cross-phenotype and population-scale analyses. As studies grow in complexity, Bayesian hierarchical models offer a principled framework for jointly modeling multiple units such as cells, traits, and experimental conditions, increasing statistical power through information sharing. However, adoption of Bayesian hierarchical models in biobank-scale studies remains limited due to computational inefficiencies, particularly in posterior inference over high-dimensional parameter spaces. Deterministic approximations such as variational inference provide scalable alternatives to Markov Chain Monte Carlo, yet current implementations do not fully exploit the structure of genome-wide multi-unit modeling, especially when biological effects of interest are concentrated in a few units. Results: We propose an adaptive focus (AF) strategy within a block coordinate ascent variational inference (CAVI) framework that selectively updates subsets of parameters at each iteration, corresponding to units deemed relevant based on current estimates. We illustrate this approach in protein quantitative trait locus (pQTL) mapping using a joint model of hierarchically linked regressions with shared parameters across traits. In both simulated data and real proteomic data from the UK Biobank, AF-CAVI achieves up to a 50\% reduction in runtime while maintaining statistical performance. We also provide a genome-wide pipeline for multi-trait pQTL mapping across thousands of traits, demonstrating AF-CAVI as an efficient scheme for large-scale, multi-unit Bayesian analysis in biobanks.

翻译：动机：现代生物库凭借前所未有的样本量和表型多样性，已成为基因组研究的基础资源，能够支持强大的跨表型及群体规模分析。随着研究复杂性增加，贝叶斯层次模型为联合建模细胞、性状、实验条件等多个单元提供了原则性框架，通过信息共享提升统计效力。然而，贝叶斯层次模型在生物库规模研究中的采用仍因计算效率低下而受限，尤其在高维参数空间的后验推断环节。变分推断等确定性逼近方法提供了马尔可夫链蒙特卡洛的可扩展替代方案，但当前实现未能充分挖掘全基因组多单元建模的结构特征，尤其当感兴趣的生物效应集中于少数单元时。结果：我们提出一种自适应聚焦策略，该策略基于块坐标上升变分推断框架，在每次迭代中仅选择更新与当前估计值相关的参数子集。通过联合建模具有跨性状共享参数的层次关联回归模型，我们在蛋白质数量性状位点定位中展示了该方法。在模拟数据和英国生物库真实蛋白质组数据中，AF-CAVI在维持统计性能的同时实现了高达50%的运行时间缩减。我们还提供了跨数千性状的多性状pQTL定位全基因组分析流程，证明AF-CAVI是生物库中进行大规模多单元贝叶斯分析的高效方案。