Identifying immune checkpoint inhibitor (ICI) studies in genomic repositories like Gene Expression Omnibus (GEO) is vital for cancer research yet remains challenging due to semantic ambiguity, extreme class imbalance, and limited labeled data in low-resource settings. We present ProtoBERT-LoRA, a hybrid framework that combines PubMedBERT with prototypical networks and Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model enforces class-separable embeddings via episodic prototype training while preserving biomedical domain knowledge. Our dataset was divided as: Training (20 positive, 20 negative), Prototype Set (10 positive, 10 negative), Validation (20 positive, 200 negative), and Test (71 positive, 765 negative). Evaluated on test dataset, ProtoBERT-LoRA achieved F1-score of 0.624 (precision: 0.481, recall: 0.887), outperforming the rule-based system, machine learning baselines and finetuned PubMedBERT. Application to 44,287 unlabeled studies reduced manual review efforts by 82%. Ablation studies confirmed that combining prototypes with LoRA improved performance by 29% over stand-alone LoRA.
翻译:在基因表达综合库(GEO)等基因组存储库中识别免疫检查点抑制剂(ICI)研究对于癌症研究至关重要,但由于语义模糊性、极端的类别不平衡以及低资源环境下有限的标记数据,这仍然是一项挑战。我们提出了ProtoBERT-LoRA,这是一个混合框架,它将PubMedBERT与原型网络和低秩自适应(LoRA)相结合,以实现高效的微调。该模型通过情景原型训练强制生成类别可分离的嵌入,同时保留生物医学领域的知识。我们的数据集划分如下:训练集(20个阳性,20个阴性)、原型集(10个阳性,10个阴性)、验证集(20个阳性,200个阴性)和测试集(71个阳性,765个阴性)。在测试集上的评估显示,ProtoBERT-LoRA取得了0.624的F1分数(精确率:0.481,召回率:0.887),其性能优于基于规则的系统、机器学习基线模型以及微调后的PubMedBERT。将其应用于44,287个未标记的研究,将人工审查工作量减少了82%。消融研究证实,将原型与LoRA结合使用,其性能比单独使用LoRA提高了29%。