Data complexity analysis quantifies the hardness of constructing a predictive model on a given dataset. However, the effectiveness of existing data complexity measures can be challenged by the existence of irrelevant features and feature interactions in biological micro-array data. We propose a novel data complexity measure, depth, that leverages an evolutionary inspired feature selection algorithm to quantify the complexity of micro-array data. By examining feature subsets of varying sizes, the approach offers a novel perspective on data complexity analysis. Unlike traditional metrics, depth is robust to irrelevant features and effectively captures complexity stemming from feature interactions. On synthetic micro-array data, depth outperforms existing methods in robustness to irrelevant features and identifying complexity from feature interactions. Applied to case-control genotype and gene-expression micro-array datasets, the results reveal that a single feature of gene-expression data can account for over 90% of the performance of multi-feature model, confirming the adequacy of the commonly used differentially expressed gene (DEG) feature selection method for the gene expression data. Our study also demonstrates that constructing predictive models for genotype data is harder than gene expression data. The results in this paper provide evidence for the use of interpretable machine learning algorithms on microarray data.
翻译:数据复杂度分析量化了在给定数据集上构建预测模型的难度。然而,现有数据复杂度度量在生物微阵列数据中因不相关特征和特征交互的存在而面临挑战。我们提出了一种新型数据复杂度度量——深度(depth),它利用进化启发的特征选择算法来量化微阵列数据的复杂度。通过考察不同规模的特征子集,该方法为数据复杂度分析提供了新视角。与传统度量不同,深度对不相关特征具有鲁棒性,并能有效捕捉由特征交互产生的复杂度。在合成微阵列数据上,深度在应对不相关特征的鲁棒性以及识别特征交互带来的复杂度方面均优于现有方法。应用于病例对照基因型与基因表达微阵列数据集时,结果显示,基因表达数据中的单个特征即可占据多特征模型性能的90%以上,这证实了常用的差异表达基因(DEG)特征选择方法对基因表达数据的适用性。我们的研究还表明,为基因型数据构建预测模型比基因表达数据更为困难。本文结果为在微阵列数据上使用可解释机器学习算法提供了证据。