Kernel methods have been proven to be a powerful tool for the integration and analysis of highthroughput technologies generated data. Kernels offer a nonlinear version of any linear algorithm solely based on dot products. The kernelized version of Principal Component Analysis is a valid nonlinear alternative to tackle the nonlinearity of biological sample spaces. This paper proposes a novel methodology to obtain a data-driven feature importance based on the KPCA representation of the data. The proposed method, kernel PCA Interpretable Gradient (KPCA-IG), provides a datadriven feature importance that is computationally fast and based solely on linear algebra calculations. It has been compared with existing methods on three benchmark datasets. The accuracy obtained using KPCA-IG selected features is equal to or greater than the other methods' average. Also, the computational complexity required demonstrates the high efficiency of the method. An exhaustive literature search has been conducted on the selected genes from a publicly available Hepatocellular carcinoma dataset to validate the retained features from a biological point of view. The results once again remark on the appropriateness of the computed ranking. The black-box nature of kernel PCA needs new methods to interpret the original features. Our proposed methodology KPCA-IG proved to be a valid alternative to select influential variables in high-dimensional high-throughput datasets, potentially unravelling new biological and medical biomarkers.
翻译:核方法已被证明是整合和分析高通量技术生成数据的强大工具。核方法仅基于点积提供任何线性算法的非线性版本。主成分分析的核化版本是处理生物样本空间非线性的有效非线性替代方案。本文提出了一种新颖的方法,基于数据的KPCA表示获得数据驱动的特征重要性。所提出的方法——核主成分分析可解释梯度(KPCA-IG)——提供了一种计算快速且仅基于线性代数计算的数据驱动特征重要性。该方法在三个基准数据集上与现有方法进行了比较。使用KPCA-IG所选特征获得的准确率等于或高于其他方法的平均值。此外,所需的计算复杂度显示了该方法的高效性。为了从生物学角度验证保留的特征,对从公开可用的肝细胞癌数据集中选定的基因进行了详尽的文献检索。结果再次强调了所计算排序的适当性。核PCA的黑箱性质需要新的方法来解释原始特征。我们提出的KPCA-IG方法被证明是从高维高通量数据集中选择影响变量的有效替代方案,有望揭示新的生物学和医学生物标志物。