This article considers the impact of different thresholding methods to the Nearest Shrunken Centroid algorithm, which is popularly referred as the Prediction Analysis of Microarrays (PAM) for high-dimensional classification. PAM uses soft thresholding to achieve high computational efficiency and high classification accuracy but in the price of retaining too many features. When applied to microarray human cancers, PAM selected 2611 features on average from 10 multi-class datasets. Such a large number of features make it difficult to perform follow up study. One reason behind this problem is the soft thresholding, which is known to produce biased parameter estimate in regression analysis. In this article, we extend the PAM algorithm with two other thresholding methods, hard and order thresholding, and a deep search algorithm to achieve better thresholding parameter estimate. The modified algorithms are extensively tested and compared to the original one based on real data and Monte Carlo studies. In general, the modification not only gave better cancer status prediction accuracy, but also resulted in more parsimonious models with significantly smaller number of features.
翻译:本文探讨了不同阈值方法对最近收缩质心算法的影响,该算法在高维分类中常被称为微阵列预测分析(PAM)。PAM采用软阈值方法以实现高计算效率和高分类精度,但代价是保留了过多特征。在应用于微阵列人类癌症数据时,PAM从10个多类别数据集中平均选取了2611个特征。如此大量的特征使得后续研究难以开展。该问题背后的一个原因是软阈值方法,该方法在回归分析中已知会导致参数估计偏差。本文通过引入硬阈值和排序阈值两种阈值方法,以及一种深度搜索算法来改进阈值参数估计,从而扩展了PAM算法。改进后的算法基于真实数据和蒙特卡洛研究进行了广泛测试,并与原始算法进行了比较。总体而言,改进后的算法不仅提高了癌症状态预测的准确性,还产生了更简约的模型,其特征数量显著减少。