Improvement of variables interpretability in kernel PCA

Kernel methods have been proven to be a powerful tool for the integration and analysis of highthroughput technologies generated data. Kernels offer a nonlinear version of any linear algorithm solely based on dot products. The kernelized version of Principal Component Analysis is a valid nonlinear alternative to tackle the nonlinearity of biological sample spaces. This paper proposes a novel methodology to obtain a data-driven feature importance based on the KPCA representation of the data. The proposed method, kernel PCA Interpretable Gradient (KPCA-IG), provides a datadriven feature importance that is computationally fast and based solely on linear algebra calculations. It has been compared with existing methods on three benchmark datasets. The accuracy obtained using KPCA-IG selected features is equal to or greater than the other methods' average. Also, the computational complexity required demonstrates the high efficiency of the method. An exhaustive literature search has been conducted on the selected genes from a publicly available Hepatocellular carcinoma dataset to validate the retained features from a biological point of view. The results once again remark on the appropriateness of the computed ranking. The black-box nature of kernel PCA needs new methods to interpret the original features. Our proposed methodology KPCA-IG proved to be a valid alternative to select influential variables in high-dimensional high-throughput datasets, potentially unravelling new biological and medical biomarkers.

翻译：核方法已被证明是整合和分析高通量技术生成数据的强大工具。核方法仅基于点积提供任何线性算法的非线性版本。主成分分析的核化版本是处理生物样本空间非线性的有效非线性替代方案。本文提出了一种新颖的方法，基于数据的KPCA表示获得数据驱动的特征重要性。所提出的方法——核主成分分析可解释梯度（KPCA-IG）——提供了一种计算快速且仅基于线性代数计算的数据驱动特征重要性。该方法在三个基准数据集上与现有方法进行了比较。使用KPCA-IG所选特征获得的准确率等于或高于其他方法的平均值。此外，所需的计算复杂度显示了该方法的高效性。为了从生物学角度验证保留的特征，对从公开可用的肝细胞癌数据集中选定的基因进行了详尽的文献检索。结果再次强调了所计算排序的适当性。核PCA的黑箱性质需要新的方法来解释原始特征。我们提出的KPCA-IG方法被证明是从高维高通量数据集中选择影响变量的有效替代方案，有望揭示新的生物学和医学生物标志物。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

Into the Metaverse，93页ppt介绍元宇宙概念、应用、趋势

专知会员服务

49+阅读 · 2022年2月19日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

46+阅读 · 2020年10月31日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日