Motivation: Machine learning methods can be used to support scientific discovery in healthcare-related research fields. However, these methods can only be reliably used if they can be trained on high-quality and curated datasets. Currently, no such dataset for the exploration of Plasmodium falciparum protein antigen candidates exists. The parasite Plasmodium falciparum causes the infectious disease malaria. Thus, identifying potential antigens is of utmost importance for the development of antimalarial drugs and vaccines. Since exploring antigen candidates experimentally is an expensive and time-consuming process, applying machine learning methods to support this process has the potential to accelerate the development of drugs and vaccines, which are needed for fighting and controlling malaria. Results: We developed PlasmoFAB, a curated benchmark that can be used to train machine learning methods for the exploration of Plasmodium falciparum protein antigen candidates. We combined an extensive literature search with domain expertise to create high-quality labels for Plasmodium falciparum specific proteins that distinguish between antigen candidates and intracellular proteins. Additionally, we used our benchmark to compare different well-known prediction models and available protein localization prediction services on the task of identifying protein antigen candidates. We show that available general-purpose services are unable to provide sufficient performance on identifying protein antigen candidates and are outperformed by our models that were trained on this tailored data. Availability: PlasmoFAB is publicly available on Zenodo with DOI 10.5281/zenodo.7433087. Furthermore, all scripts that were used in the creation of PlasmoFAB and the training and evaluation of machine learning models are open source and publicly available on GitHub here: https://github.com/msmdev/PlasmoFAB.
翻译:动机:机器学习方法可用于支持医疗相关研究领域的科学发现。然而,这些方法只有在能够基于高质量且经过精心整理的训练数据集时才能被可靠使用。目前,针对恶性疟原虫蛋白质抗原候选物的探索尚缺乏此类数据集。恶性疟原虫寄生虫是导致传染性疾病疟疾的病原体。因此,识别潜在抗原对于抗疟药物和疫苗的开发至关重要。由于通过实验探索抗原候选物是一项昂贵且耗时的过程,应用机器学习方法支持该过程有望加速对抗和控制疟疾所需的药物与疫苗的研发。结果:我们开发了PlasmoFAB,这是一个经过整理的基准数据集,可用于训练机器学习方法以探索恶性疟原虫蛋白质抗原候选物。我们结合了广泛的文献检索与领域专业知识,为恶性疟原虫特异性蛋白质创建了高质量标签,用以区分抗原候选物与细胞内蛋白质。此外,我们利用该基准比较了多种知名预测模型及现有蛋白质定位预测服务在识别蛋白质抗原候选物任务上的表现。研究表明,现有的通用服务无法在识别蛋白质抗原候选物方面提供足够性能,且被我们基于定制数据训练的模型所超越。可用性:PlasmoFAB已通过DOI 10.5281/zenodo.7433087公开存储在Zenodo平台上。此外,所有用于构建PlasmoFAB以及训练和评估机器学习模型的脚本均为开源代码,并已在GitHub上公开:https://github.com/msmdev/PlasmoFAB。