This paper addresses theoretical issues associated with probabilistic partial least squares (PLS) regression. As in the case of factor analysis, the probabilistic PLS regression with unique variance suffers from the issues of improper solutions and lack of identifiability, both of which causes difficulties in interpreting latent variables and model parameters. Using the fact that the probabilistic PLS regression can be viewed as a special case of factor analysis, we apply a norm constraint prescription on the factor loading matrix in the probabilistic PLS regression, which was recently proposed in the context of factor analysis to avoid improper solutions. Then, we prove that the probabilistic PLS regression with this norm constraint is identifiable. We apply the probabilistic PLS regression to data on amino acid mutations in Human Immunodeficiency Virus (HIV) protease to demonstrate the validity of the norm constraint and to confirm the identifiability numerically. Utilizing the proposed constraint enables the visualization of latent variables via a biplot. We also investigate the sampling distribution of the maximum likelihood estimates (MLE) using synthetically generated data. We numerically observe that MLE is consistent and asymptotically normally distributed.
翻译:本文探讨了概率偏最小二乘(PLS)回归中的理论问题。与因子分析类似,带有独特方差的概率PLS回归存在不当解和缺乏可识别性的问题,这两者均导致潜在变量和模型参数的解释困难。利用概率PLS回归可视为因子分析特例这一事实,我们在概率PLS回归中对因子载荷矩阵应用了范数约束方法——该方法最近在因子分析中被提出以避免不当解。随后,我们证明了采用此范数约束的概率PLS回归具有可识别性。我们将概率PLS回归应用于人类免疫缺陷病毒(HIV)蛋白酶氨基酸突变数据,以验证范数约束的有效性并数值化确认可识别性。利用所提出的约束条件,可通过双标图实现潜在变量的可视化。此外,我们使用合成生成的数据研究了最大似然估计(MLE)的抽样分布。数值实验表明,MLE具有一致性且渐近服从正态分布。