Accurate classification of cancer-related medical abstracts is crucial for healthcare management and research. However, obtaining large, labeled datasets in the medical domain is challenging due to privacy concerns and the complexity of clinical data. This scarcity of annotated data impedes the development of effective machine learning models for cancer document classification. To address this challenge, we present a curated dataset of 1,874 biomedical abstracts, categorized into thyroid cancer, colon cancer, lung cancer, and generic topics. Our research focuses on leveraging this dataset to improve classification performance, particularly in data-scarce scenarios. We introduce a Residual Graph Attention Network (R-GAT) with multiple graph attention layers that capture the semantic information and structural relationships within cancer-related documents. Our R-GAT model is compared with various techniques, including transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT), RoBERTa, and domain-specific models like BioBERT and Bio+ClinicalBERT. We also evaluated deep learning models (CNNs, LSTMs) and traditional machine learning models (Logistic Regression, SVM). Additionally, we explore ensemble approaches that combine deep learning models to enhance classification. Various feature extraction methods are assessed, including Term Frequency-Inverse Document Frequency (TF-IDF) with unigrams and bigrams, Word2Vec, and tokenizers from BERT and RoBERTa. The R-GAT model outperforms other techniques, achieving precision, recall, and F1 scores of 0.99, 0.97, and 0.98 for thyroid cancer; 0.96, 0.94, and 0.95 for colon cancer; 0.96, 0.99, and 0.97 for lung cancer; and 0.95, 0.96, and 0.95 for generic topics.
翻译:癌症相关医学摘要的准确分类对于医疗健康管理和研究至关重要。然而,由于隐私问题和临床数据的复杂性,在医学领域获取大规模标注数据集具有挑战性。这种标注数据的稀缺性阻碍了用于癌症文档分类的有效机器学习模型的开发。为应对这一挑战,我们构建了一个包含1,874篇生物医学摘要的精选数据集,涵盖甲状腺癌、结肠癌、肺癌及通用主题类别。本研究重点利用该数据集提升分类性能,特别是在数据稀缺场景下。我们提出了一种具有多层图注意力机制的残差图注意力网络(R-GAT),该网络能够捕捉癌症相关文档中的语义信息与结构关系。我们将R-GAT模型与多种技术进行了比较,包括基于Transformer的模型(如BERT、RoBERTa)及领域专用模型(如BioBERT和Bio+ClinicalBERT),同时评估了深度学习模型(CNN、LSTM)和传统机器学习模型(逻辑回归、SVM)。此外,我们还探索了融合深度学习模型的集成方法以提升分类性能。研究评估了多种特征提取方法,包括基于单双词组的词频-逆文档频率(TF-IDF)、Word2Vec以及BERT与RoBERTa的分词器。实验结果表明,R-GAT模型在所有对比方法中表现最优,其精确率、召回率和F1分数在甲状腺癌分类上分别达到0.99、0.97和0.98;在结肠癌分类上为0.96、0.94和0.95;在肺癌分类上为0.96、0.99和0.97;在通用主题分类上为0.95、0.96和0.95。