Accurately labeling biomedical data presents a challenge. Traditional semi-supervised learning methods often under-utilize available unlabeled data. To address this, we propose a novel reliability-based training data cleaning method employing inductive conformal prediction (ICP). This method capitalizes on a small set of accurately labeled training data and leverages ICP-calculated reliability metrics to rectify mislabeled data and outliers within vast quantities of noisy training data. The efficacy of the method is validated across three classification tasks within distinct modalities: filtering drug-induced-liver-injury (DILI) literature with title and abstract, predicting ICU admission of COVID-19 patients through CT radiomics and electronic health records, and subtyping breast cancer using RNA-sequencing data. Varying levels of noise to the training labels were introduced through label permutation. Results show significant enhancements in classification performance: accuracy enhancement in 86 out of 96 DILI experiments (up to 11.4%), AUROC and AUPRC enhancements in all 48 COVID-19 experiments (up to 23.8% and 69.8%), and accuracy and macro-average F1 score improvements in 47 out of 48 RNA-sequencing experiments (up to 74.6% and 89.0%). Our method offers the potential to substantially boost classification performance in multi-modal biomedical machine learning tasks. Importantly, it accomplishes this without necessitating an excessive volume of meticulously curated training data.
翻译:准确标注生物医学数据面临挑战。传统半监督学习方法往往未能充分利用可用的未标注数据。为此,我们提出了一种基于诱导置信预测(ICP)的新型可靠性训练数据清洗方法。该方法利用少量精确标注的训练数据,通过ICP计算的可靠性指标来修正海量噪声训练数据中的误标数据和异常值。该方法的有效性在三个跨不同模态的分类任务中得到验证:利用标题和摘要筛选药物性肝损伤(DILI)文献、通过CT影像组学和电子健康记录预测COVID-19患者入住重症监护室(ICU)、以及利用RNA测序数据进行乳腺癌亚型分类。通过标签置换引入不同级别的噪声。结果显示分类性能显著提升:在96个DILI实验中,86个实验的准确率提升(最高达11.4%);在全部48个COVID-19实验中,AUROC和AUPRC指标提升(最高分别达23.8%和69.8%);在48个RNA测序实验中,47个实验的准确率和宏平均F1分数提升(最高分别达74.6%和89.0%)。本方法具有显著提升多模态生物医学机器学习任务分类性能的潜力。重要的是,该方法无需依赖海量精心标注的训练数据即可实现该效果。