The selection of best variables is a challenging problem in supervised and unsupervised learning, especially in high dimensional contexts where the number of variables is usually much larger than the number of observations. In this paper, we focus on two multivariate statistical methods: principal components analysis and partial least squares. Both approaches are popular linear dimension-reduction methods with numerous applications in several fields including in genomics, biology, environmental science, and engineering. In particular, these approaches build principal components, new variables that are combinations of all the original variables. A main drawback of principal components is the difficulty to interpret them when the number of variables is large. To define principal components from the most relevant variables, we propose to cast the best subset solution path method into principal component analysis and partial least square frameworks. We offer a new alternative by exploiting a continuous optimization algorithm for best subset solution path. Empirical studies show the efficacy of our approach for providing the best subset solution path. The usage of our algorithm is further exposed through the analysis of two real datasets. The first dataset is analyzed using the principle component analysis while the analysis of the second dataset is based on partial least square framework.
翻译:变量的最佳选择是监督与无监督学习中的难题,尤其在变量数量远大于观测数量的高维情境中更为突出。本文聚焦于两种多元统计方法:主成分分析与偏最小二乘。这两种方法作为流行的线性降维方法,在基因组学、生物学、环境科学及工程等领域具有广泛应用。具体而言,它们通过构建主成分——原始变量线性组合而成的新变量——实现降维。然而当变量数量较多时,主成分难以解释是其主要缺陷。为从最相关变量中定义主成分,我们提出将最佳子集解路径方法融入主成分分析和偏最小二乘框架。通过利用连续优化算法求解最佳子集解路径,我们提供了新选择。实证研究表明,该方法在生成最佳子集解路径方面具有有效性。通过分析两个真实数据集进一步展示了算法的应用:第一个数据集采用主成分分析,第二个数据集基于偏最小二乘框架。