Bayesian approaches to clinical analyses for the purposes of patient phenotyping have been limited by the computational challenges associated with applying the Markov-Chain Monte-Carlo (MCMC) approach to large real-world data. Approximate Bayesian inference via optimization of the variational evidence lower bound, often called Variational Bayes (VB), has been successfully demonstrated for other applications. We investigate the performance and characteristics of currently available R and Python VB software for variational Bayesian Latent Class Analysis (LCA) of realistically large real-world observational data. We used a real-world data set, Optum\textsuperscript{TM} electronic health records (EHR), containing pediatric patients with risk indicators for type 2 diabetes mellitus that is a rare form in pediatric patients. The aim of this work is to validate a Bayesian patient phenotyping model for generality and extensibility and crucially that it can be applied to a realistically large real-world clinical data set. We find currently available automatic VB methods are very sensitive to initial starting conditions, model definition, algorithm hyperparameters and choice of gradient optimiser. The Bayesian LCA model was challenging to implement using VB but we achieved reasonable results with very good computational performance compared to MCMC.
翻译:以患者表型分析为目的的临床贝叶斯分析方法一直受到计算挑战的限制,这些挑战源于将马尔可夫链蒙特卡洛(MCMC)方法应用于大规模真实世界数据。通过优化变分证据下界(常称为变分贝叶斯)的近似贝叶斯推断已在其他应用中成功得到验证。本研究探究了当前可用的R和Python变分贝叶斯软件在真实大规模观察性数据中进行变分贝叶斯隐类分析(LCA)的性能与特征。我们使用了一组真实世界数据集——Optum\textsuperscript{TM}电子健康记录(EHR),该数据集包含具有2型糖尿病风险指标(在儿科患者中属罕见类型)的儿科患者。本工作旨在验证一种贝叶斯患者表型分析模型的通用性与可扩展性,并证明其能应用于真实世界的大规模临床数据集。研究发现,当前可用的自动变分贝叶斯方法对初始条件、模型定义、算法超参数以及梯度优化器的选择极为敏感。使用变分贝叶斯实现贝叶斯隐类分析模型具有一定挑战性,但相较于MCMC方法,我们以极佳的计算性能获得了合理的结果。