Large protein language models are adept at capturing the underlying evolutionary information in primary structures, offering significant practical value for protein engineering. Compared to natural language models, protein amino acid sequences have a smaller data volume and a limited combinatorial space. Choosing an appropriate vocabulary size to optimize the pre-trained model is a pivotal issue. Moreover, despite the wealth of benchmarks and studies in the natural language community, there remains a lack of a comprehensive benchmark for systematically evaluating protein language model quality. Given these challenges, PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities, incorporating two classification heads and three random seeds to mitigate potential biases. Extensive experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance. Our code, model weights and datasets are available at https://github.com/ginnm/ProteinPretraining.
翻译:大型蛋白质语言模型能够有效捕获一级结构中蕴含的进化信息,为蛋白质工程提供了重要的实用价值。相较于自然语言模型,蛋白质氨基酸序列的数据量较小且组合空间有限。选择合适的词表大小以优化预训练模型是一个关键问题。此外,尽管自然语言社区已积累了大量基准测试与研究成果,但目前仍缺乏系统性评估蛋白质语言模型质量的综合基准。针对这些挑战,PETA在三种分词方法下训练了14种不同词表大小的语言模型。为评估模型的迁移学习能力,研究在33个多样化的下游数据集上进行了数千次测试,并引入两种分类头与三个随机种子以消除潜在偏差。大量实验表明,词表大小在50至200之间可优化模型性能,而超过800则会对模型表征能力产生负面影响。我们的代码、模型权重及数据集已公开于https://github.com/ginnm/ProteinPretraining。