Canonical correlation analysis (CCA) is a popular statistical technique for exploring relationships between datasets. In recent years, the estimation of sparse canonical vectors has emerged as an important but challenging variant of the CCA problem, with widespread applications. Unfortunately, existing rate-optimal estimators for sparse canonical vectors have high computational cost. We propose a quasi-Bayesian estimation procedure that not only achieves the minimax estimation rate, but also is easy to compute by Markov Chain Monte Carlo (MCMC). The method builds on Tan et al. (2018) and uses a re-scaled Rayleigh quotient function as the quasi-log-likelihood. However, unlike Tan et al. (2018), we adopt a Bayesian framework that combines this quasi-log-likelihood with a spike-and-slab prior to regularize the inference and promote sparsity. We investigate the empirical behavior of the proposed method on both continuous and truncated data, and we demonstrate that it outperforms several state-of-the-art methods. As an application, we use the proposed methodology to maximally correlate clinical variables and proteomic data for better understanding the Covid-19 disease.
翻译:典型相关分析(CCA)是一种用于探索数据集间关系的流行统计技术。近年来,稀疏典型向量的估计已成为CCA问题中重要但具有挑战性的变体,具有广泛的应用。遗憾的是,现有针对稀疏典型向量的速率最优估计器计算成本高昂。我们提出了一种拟贝叶斯估计程序,它不仅能达到极小极大估计速率,而且易于通过马尔可夫链蒙特卡洛方法计算。该方法建立在Tan等人(2018)的研究基础上,使用重标定的雷利商函数作为拟对数似然函数。然而,与Tan等人(2018)不同,我们采用贝叶斯框架,将该拟对数似然函数与尖峰-板状先验相结合,以正则化推断并促进稀疏性。我们研究了该方法在连续数据和截断数据上的经验表现,并证明其优于多种现有最优方法。作为应用,我们利用所提出的方法最大化临床变量与蛋白质组学数据之间的相关性,以更深入地理解新冠病毒疾病。