Kernel methods have been proven to be a powerful tool for the integration and analysis of highthroughput technologies generated data. Kernels offer a nonlinear version of any linear algorithm solely based on dot products. The kernelized version of Principal Component Analysis is a valid nonlinear alternative to tackle the nonlinearity of biological sample spaces. This paper proposes a novel methodology to obtain a data-driven feature importance based on the KPCA representation of the data. The proposed method, kernel PCA Interpretable Gradient (KPCA-IG), provides a datadriven feature importance that is computationally fast and based solely on linear algebra calculations. It has been compared with existing methods on three benchmark datasets. The accuracy obtained using KPCA-IG selected features is equal to or greater than the other methods' average. Also, the computational complexity required demonstrates the high efficiency of the method. An exhaustive literature search has been conducted on the selected genes from a publicly available Hepatocellular carcinoma dataset to validate the retained features from a biological point of view. The results once again remark on the appropriateness of the computed ranking. The black-box nature of kernel PCA needs new methods to interpret the original features. Our proposed methodology KPCA-IG proved to be a valid alternative to select influential variables in high-dimensional high-throughput datasets, potentially unravelling new biological and medical biomarkers.
翻译:核方法已被证明是整合与分析高通量技术生成数据的强大工具。核为任何仅基于点积的线性算法提供了非线性版本。主成分分析的核化版本是处理生物样本空间非线性问题的有效非线性替代方案。本文提出了一种基于数据的核主成分分析表示来获取数据驱动的特征重要性的新方法。所提出的方法——核主成分分析可解释梯度(KPCA-IG),提供了一种计算快速且完全基于线性代数计算的数据驱动特征重要性评估方法。该方法已在三个基准数据集上与现有方法进行了比较。使用KPCA-IG所选特征获得的准确度等于或优于其他方法的平均水平。同时,所需的计算复杂度证明了该方法的高效性。为从生物学角度验证所保留的特征,我们对公开可用的肝细胞癌数据集中选定的基因进行了详尽的文献检索。结果再次印证了计算排序的合理性。核主成分分析的黑箱特性需要新的方法来解释原始特征。我们提出的KPCA-IG方法被证明是在高维高通量数据集中选择影响变量的有效替代方案,有望揭示新的生物学和医学生物标志物。