PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

Large protein language models are adept at capturing the underlying evolutionary information in primary structures, offering significant practical value for protein engineering. Compared to natural language models, protein amino acid sequences have a smaller data volume and a limited combinatorial space. Choosing an appropriate vocabulary size to optimize the pre-trained model is a pivotal issue. Moreover, despite the wealth of benchmarks and studies in the natural language community, there remains a lack of a comprehensive benchmark for systematically evaluating protein language model quality. Given these challenges, PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities, incorporating two classification heads and three random seeds to mitigate potential biases. Extensive experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance. Our code, model weights and datasets are available at https://github.com/ginnm/ProteinPretraining.

翻译：大型蛋白质语言模型能够有效捕获一级结构中蕴含的进化信息，为蛋白质工程提供了重要的实用价值。相较于自然语言模型，蛋白质氨基酸序列的数据量较小且组合空间有限。选择合适的词表大小以优化预训练模型是一个关键问题。此外，尽管自然语言社区已积累了大量基准测试与研究成果，但目前仍缺乏系统性评估蛋白质语言模型质量的综合基准。针对这些挑战，PETA在三种分词方法下训练了14种不同词表大小的语言模型。为评估模型的迁移学习能力，研究在33个多样化的下游数据集上进行了数千次测试，并引入两种分类头与三个随机种子以消除潜在偏差。大量实验表明，词表大小在50至200之间可优化模型性能，而超过800则会对模型表征能力产生负面影响。我们的代码、模型权重及数据集已公开于https://github.com/ginnm/ProteinPretraining。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/