PlasmoFAB: A Benchmark to Foster Machine Learning for Plasmodium falciparum Protein Antigen Candidate Prediction

Motivation: Machine learning methods can be used to support scientific discovery in healthcare-related research fields. However, these methods can only be reliably used if they can be trained on high-quality and curated datasets. Currently, no such dataset for the exploration of Plasmodium falciparum protein antigen candidates exists. The parasite Plasmodium falciparum causes the infectious disease malaria. Thus, identifying potential antigens is of utmost importance for the development of antimalarial drugs and vaccines. Since exploring antigen candidates experimentally is an expensive and time-consuming process, applying machine learning methods to support this process has the potential to accelerate the development of drugs and vaccines, which are needed for fighting and controlling malaria. Results: We developed PlasmoFAB, a curated benchmark that can be used to train machine learning methods for the exploration of Plasmodium falciparum protein antigen candidates. We combined an extensive literature search with domain expertise to create high-quality labels for Plasmodium falciparum specific proteins that distinguish between antigen candidates and intracellular proteins. Additionally, we used our benchmark to compare different well-known prediction models and available protein localization prediction services on the task of identifying protein antigen candidates. We show that available general-purpose services are unable to provide sufficient performance on identifying protein antigen candidates and are outperformed by our models that were trained on this tailored data. Availability: PlasmoFAB is publicly available on Zenodo with DOI 10.5281/zenodo.7433087. Furthermore, all scripts that were used in the creation of PlasmoFAB and the training and evaluation of machine learning models are open source and publicly available on GitHub here: https://github.com/msmdev/PlasmoFAB.

翻译：动机：机器学习方法可用于支持医疗相关研究领域的科学发现。然而，这些方法只有在能够基于高质量且经过精心整理的训练数据集时才能被可靠使用。目前，针对恶性疟原虫蛋白质抗原候选物的探索尚缺乏此类数据集。恶性疟原虫寄生虫是导致传染性疾病疟疾的病原体。因此，识别潜在抗原对于抗疟药物和疫苗的开发至关重要。由于通过实验探索抗原候选物是一项昂贵且耗时的过程，应用机器学习方法支持该过程有望加速对抗和控制疟疾所需的药物与疫苗的研发。结果：我们开发了PlasmoFAB，这是一个经过整理的基准数据集，可用于训练机器学习方法以探索恶性疟原虫蛋白质抗原候选物。我们结合了广泛的文献检索与领域专业知识，为恶性疟原虫特异性蛋白质创建了高质量标签，用以区分抗原候选物与细胞内蛋白质。此外，我们利用该基准比较了多种知名预测模型及现有蛋白质定位预测服务在识别蛋白质抗原候选物任务上的表现。研究表明，现有的通用服务无法在识别蛋白质抗原候选物方面提供足够性能，且被我们基于定制数据训练的模型所超越。可用性：PlasmoFAB已通过DOI 10.5281/zenodo.7433087公开存储在Zenodo平台上。此外，所有用于构建PlasmoFAB以及训练和评估机器学习模型的脚本均为开源代码，并已在GitHub上公开：https://github.com/msmdev/PlasmoFAB。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

116+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日