AI language models (LMs) show promise for biological sequence analysis. We re-engineered open-source LMs (GPT-2, BLOOM, DistilRoBERTa, ELECTRA, and Mamba, ranging from 70M to 12B parameters) for microbial sequence classification. The models achieved F1 scores up to 95 and operated 16,580x faster and at 2.9x the recall of BLASTP. They effectively classified the algal dark proteome - uncharacterized proteins comprising about 65% of total proteins - validated on new data including a new, complete Hi-C/Pacbio Chlamydomonas genome. Larger (>1B) LA4SR models reached high accuracy (F1 > 86) when trained on less than 2% of available data, rapidly achieving strong generalization capacity. High accuracy was achieved when training data had intact or scrambled terminal information, demonstrating robust generalization to incomplete sequences. Finally, we provide custom AI explainability software tools for attributing amino acid patterns to AI generative processes and interpret their outputs in evolutionary and biophysical contexts.
翻译:人工智能语言模型在生物序列分析领域展现出巨大潜力。本研究对开源语言模型(包括参数规模从7000万到120亿不等的GPT-2、BLOOM、DistilRoBERTa、ELECTRA和Mamba)进行了重构,应用于微生物序列分类任务。这些模型取得了高达95的F1分数,运算速度较BLASTP提升16,580倍,召回率提升2.9倍。它们成功实现了对藻类暗蛋白质组——约占蛋白质总量65%的未表征蛋白质——的有效分类,该结果已通过包含新型完整Hi-C/Pacbio衣藻基因组在内的新数据验证。参数量超过10亿的大型LA4SR模型在仅使用不足2%可用数据训练的情况下,即可达到较高准确率(F1 > 86),展现出快速获得强大泛化能力的特点。当训练数据包含完整或置乱的末端信息时,模型仍能保持高精度,表明其对不完整序列具有稳健的泛化性能。最后,我们开发了定制化人工智能可解释性软件工具,可将氨基酸模式归因于AI生成过程,并在进化与生物物理学背景下解读其输出结果。