LA4SR: illuminating the dark proteome with generative AI

AI language models (LMs) show promise for biological sequence analysis. We re-engineered open-source LMs (GPT-2, BLOOM, DistilRoBERTa, ELECTRA, and Mamba, ranging from 70M to 12B parameters) for microbial sequence classification. The models achieved F1 scores up to 95 and operated 16,580x faster and at 2.9x the recall of BLASTP. They effectively classified the algal dark proteome - uncharacterized proteins comprising about 65% of total proteins - validated on new data including a new, complete Hi-C/Pacbio Chlamydomonas genome. Larger (>1B) LA4SR models reached high accuracy (F1 > 86) when trained on less than 2% of available data, rapidly achieving strong generalization capacity. High accuracy was achieved when training data had intact or scrambled terminal information, demonstrating robust generalization to incomplete sequences. Finally, we provide custom AI explainability software tools for attributing amino acid patterns to AI generative processes and interpret their outputs in evolutionary and biophysical contexts.

翻译：人工智能语言模型（LMs）在生物序列分析中展现出潜力。我们对开源语言模型（GPT-2、BLOOM、DistilRoBERTa、ELECTRA 和 Mamba，参数量从 7000 万到 120 亿不等）进行了重新设计，用于微生物序列分类。这些模型实现了高达 95 的 F1 分数，运行速度比 BLASTP 快 16,580 倍，且召回率是其 2.9 倍。它们有效分类了藻类暗蛋白质组——即约占蛋白质总量 65% 的未表征蛋白质，并在包括一个新的、完整的 Hi-C/Pacbio 衣藻基因组在内的新数据上得到了验证。较大的（>10 亿参数）LA4SR 模型在仅使用不到 2% 的可用数据进行训练时即可达到高准确率（F1 > 86），快速实现了强大的泛化能力。即使训练数据包含完整或打乱的末端信息，模型也能达到高准确率，这表明其对不完整序列具有稳健的泛化能力。最后，我们提供了定制的人工智能可解释性软件工具，用于将氨基酸模式归因于人工智能生成过程，并在进化和生物物理背景下解释其输出。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日