Optimizing PCA for Health and Care Research: A Reliable Approach to Component Selection

PCA is widely used in health and care research to analyze complex HD datasets, such as patient health records, genetic data, and medical imaging. By reducing dimensionality, PCA helps identify key patterns and trends, which can aid in disease diagnosis, treatment optimization, and the discovery of new biomarkers. However, the primary goal of any dimensional reduction technique is to reduce the dimensionality in a data set while keeping the essential information and variability. There are a few ways to do this in practice, such as the Kaiser-Guttman criterion, Cattell's Scree Test, and the percent cumulative variance approach. Unfortunately, the results of these methods are entirely different. That means using inappropriate methods to find the optimal number of PCs retained in PCA may lead to misinterpreted and inaccurate results in PCA and PCA-related health and care research applications. This contradiction becomes even more pronounced in HD settings where n < p, making it even more critical to determine the best approach. Therefore, it is necessary to identify the issues of different techniques to select the optimal number of PCs retained in PCA. Kaiser-Guttman criterion retains fewer PCs, causing overdispersion, while Cattell's scree test retains more PCs, compromising reliability. The percentage of cumulative variation criterion offers greater stability, consistently selecting the optimal number of components. Therefore, the Pareto chart, which shows both the cumulative percentage and the cut-off point for retained PCs, provides the most reliable method of selecting components, ensuring stability and enhancing PCA effectiveness, particularly in health-related research applications.

翻译：主成分分析（PCA）在健康与护理研究中被广泛用于分析复杂的高维数据集，如患者健康记录、遗传数据和医学影像。通过降维，PCA有助于识别关键模式和趋势，从而辅助疾病诊断、治疗优化以及新生物标志物的发现。然而，任何降维技术的主要目标都是在保留基本信息和变异性的同时降低数据集的维度。实践中存在多种实现方式，例如Kaiser-Guttman准则、Cattell碎石检验和累积方差百分比法。遗憾的是，这些方法得出的结果往往截然不同。这意味着使用不恰当的方法来确定PCA中保留的最优主成分数量，可能导致PCA及其相关健康与护理研究应用中的结果被误解或产生偏差。这种矛盾在高维且样本量小于变量数的场景中尤为突出，使得确定最佳方法变得更为关键。因此，有必要厘清不同技术在选择PCA最优保留主成分数量时存在的问题。Kaiser-Guttman准则倾向于保留较少主成分，可能导致过度离散；而Cattell碎石检验则倾向于保留较多主成分，可能损害结果的可靠性。累积方差百分比准则展现出更高的稳定性，能够持续选择最优的成分数量。因此，同时展示累积百分比和保留主成分截断点的帕累托图，提供了最可靠的成分选择方法，确保了稳定性并提升了PCA的有效性，尤其在健康相关研究应用中具有重要价值。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日