We propose a new algebraic topological framework, which obtains intrinsic information from the MALDI data and transforms it to reflect topological persistence in the data. Our framework has two main advantages. First, the topological persistence helps us to distinguish the signal from noise. Second, it compresses the MALDI data, which results in saving storage space, and also optimizes the computational time for further classification tasks. We introduce an algorithm that performs our topological framework and depends on a single tuning parameter. Furthermore, we show that it is computationally efficient. Following the persistence extraction, logistic regression and random forest classifiers are executed based on the resulting persistence transformation diagrams to classify the observational units into binary class labels, describing the lung cancer subtypes. Further, we utilized the proposed framework in a real-world MALDI data set, and the competitiveness of the methods is illustrated via cross-validation.
翻译:我们提出一种新的代数拓扑框架,该框架可从MALDI数据中提取内在信息,并将其转化为反映数据拓扑持续性的表达形式。该框架具有两大优势:其一,拓扑持续性有助于区分信号与噪声;其二,其数据压缩能力可节省存储空间,同时优化后续分类任务的计算时间。我们提出了一种基于单一调优参数实现该拓扑框架的算法,并证明了其计算高效性。在持续性提取完成后,基于得到的持续性变换图,分别采用逻辑回归与随机森林分类器对观测单元进行二分类,以描述肺癌亚型。此外,我们将该框架应用于真实MALDI数据集,并通过交叉验证展示了方法的竞争力。