An efficient Bayesian approach to joint functional principal component analysis for complex sampling designs

The analysis of multivariate functional curves has the potential to yield important scientific discoveries in domains such as healthcare, medicine, economics and social sciences. However it is common for real-world settings to present data that are both sparse and irregularly sampled, and this introduces important challenges for the current functional data methodology. Here we propose a Bayesian hierarchical framework for multivariate functional principal component analysis which accommodates the intricacies of such sampling designs by flexibly pooling information across subjects and correlated curves. Our model represents common latent dynamics via shared functional principal component scores, thereby effectively borrowing strength across curves while circumventing the computationally challenging task of estimating covariance matrices. These scores also provide a parsimonious representation of the major modes of joint variation of the curves, and constitute interpretable scalar summaries that can be employed in follow-up analyses. We perform inference using a variational message passing algorithm which combines efficiency, modularity and approximate posterior density estimation, enabling the joint analysis of large datasets with parameter uncertainty quantification. We conduct detailed simulations to assess the effectiveness of our approach in sharing information under complex sampling designs. We also exploit it to estimate the molecular disease courses of individual patients with SARS-CoV-2 infection and characterise patient heterogeneity in recovery outcomes; this study reveals key coordinated dynamics across the immune, inflammatory and metabolic systems, which are associated with survival and long-COVID symptoms up to one year post disease onset. Our approach is implemented in the R package bayesFPCA.

翻译：多变量函数曲线的分析有潜力在医疗、医学、经济学及社会科学等领域产生重要科学发现。然而，现实场景中常见数据既稀疏又不规则采样，这为当前函数型数据方法论带来了严峻挑战。本文提出一种贝叶斯层次框架，用于多变量函数型主成分分析。该框架通过灵活整合受试者与相关曲线间的信息，适应此类采样设计的复杂性。模型通过共享函数型主成分得分刻画共同潜在动态，从而在规避协方差矩阵估计这一计算难题的同时，有效增强不同曲线间的信息借用。这些得分还提供了曲线联合变异主要模式的简约表示，形成可用于后续分析的可解释标量汇总。我们采用变分消息传递算法进行推断，该算法兼具高效性、模块化和近似后验密度估计能力，支持含参数不确定性量化的大规模数据集联合分析。通过详细模拟实验，我们评估了该方法在复杂抽样设计中共享信息的有效性。同时，将其应用于估计SARS-CoV-2感染个体患者的分子病程，并刻画患者康复结局的异质性；该研究揭示了免疫、炎症及代谢系统间关键协同动态，这些动态与患者生存率及发病后长达一年的长新冠症状相关。本方法已在R包bayesFPCA中实现。