ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language Models - 专知论文

会员服务 ·

0

序列设计 · 残基 · 蛋白质工程 · 蛋白序列设计 · 语言模型 ·

2023 年 3 月 29 日

ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language Models

翻译：ProtFIM：基于蛋白质语言模型的中间填充蛋白质序列设计

Youhan Lee,Hasun Yu

from arxiv, Preprint

Protein language models (pLMs), pre-trained via causal language modeling on protein sequences, have been a promising tool for protein sequence design. In real-world protein engineering, there are many cases where the amino acids in the middle of a protein sequence are optimized while maintaining other residues. Unfortunately, because of the left-to-right nature of pLMs, existing pLMs modify suffix residues by prompting prefix residues, which are insufficient for the infilling task that considers the whole surrounding context. To find the more effective pLMs for protein engineering, we design a new benchmark, Secondary structureE InFilling rEcoveRy, SEIFER, which approximates infilling sequence design scenarios. With the evaluation of existing models on the benchmark, we reveal the weakness of existing language models and show that language models trained via fill-in-middle transformation, called ProtFIM, are more appropriate for protein engineering. Also, we prove that ProtFIM generates protein sequences with decent protein representations through exhaustive experiments and visualizations.

翻译：蛋白质语言模型（pLMs）通过因果语言建模在蛋白质序列上进行预训练，已成为蛋白质序列设计的有力工具。在实际的蛋白质工程中，许多场景需要优化蛋白质序列中间部分的氨基酸，同时维持其余残基不变。然而，由于pLMs固有的从左到右生成特性，现有pLMs仅通过前缀残基引导后缀残基的修改，难以胜任需考虑完整上下文环境的填充任务。为寻找更适用于蛋白质工程的pLMs，我们设计了新基准SEIFER（二级结构填充恢复），用于近似模拟蛋白质填充序列设计场景。通过在该基准上评估现有模型，我们揭示了当前语言模型的局限性，并证明采用中间填充变换训练的语言模型（称为ProtFIM）更适用于蛋白质工程。此外，通过大量实验和可视化分析，我们证实ProtFIM生成的蛋白质序列具有优异的蛋白质表征能力。

0

相关内容

序列设计

Nat. Biotechnol. | 使用语言模型和深度学习的单序列蛋白质结构预测

Nat. Biotechnol. | 使用语言模型和深度学习的单序列蛋白质结构预测

专知会员服务

11+阅读 · 2022年10月17日

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

专知会员服务

12+阅读 · 2022年9月18日

Nat. Commun. | 用于蛋白质设计的深度无监督语言模型ProtGPT2

Nat. Commun. | 用于蛋白质设计的深度无监督语言模型ProtGPT2

专知会员服务

12+阅读 · 2022年8月1日

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

专知会员服务

27+阅读 · 2022年5月19日

【哈佛大学】使用AlphaFold估算蛋白质模型精度的最新技术，State-of-the-Art Estimation of Protein Model Accuracy using AlphaFold

【哈佛大学】使用AlphaFold估算蛋白质模型精度的最新技术，State-of-the-Art Estimation of Protein Model Accuracy using AlphaFold

专知会员服务

21+阅读 · 2022年3月14日

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

专知会员服务

17+阅读 · 2022年3月6日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

专知会员服务

96+阅读 · 2020年4月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

GenomicAI

0+阅读 · 2022年5月14日

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

AI前线

12+阅读 · 2019年7月22日

一文详解Google最新NLP模型XLNet

一文详解Google最新NLP模型XLNet

PaperWeekly

18+阅读 · 2019年7月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

中文版-BERT-预训练的深度双向Transformer语言模型-详细介绍

中文版-BERT-预训练的深度双向Transformer语言模型-详细介绍

深度学习与NLP

32+阅读 · 2019年3月30日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

基于结构与序列信息的蛋白质-配体结合位点的预测

国家自然科学基金

8+阅读 · 2015年12月31日

以大豆粕蛋白为原料可控制备蛋白质基表面活性剂机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

小分子TRF2抑制剂的设计、合成及表征

国家自然科学基金

0+阅读 · 2014年12月31日

微流场中蚕丝蛋白结构变化定量研究

国家自然科学基金

0+阅读 · 2013年12月31日

新型维甲酸受体（RAR）激动剂的筛选及其功能调控机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

负载碳多孔有机插层LDHs的组装及对氯酚的增强吸附机理与选择性

国家自然科学基金

0+阅读 · 2012年12月31日

蛋白质与蛋白质的结合位点结构比对方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

IRES调控EV71神经毒性的分子机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

核磁共振研究蛋白质内含子的溶液性质与蛋白剪接机理

国家自然科学基金

0+阅读 · 2008年12月31日

Scaling laws for language encoding models in fMRI

Arxiv

0+阅读 · 2023年5月19日

Reducing Sequence Length by Predicting Edit Operations with Large Language Models

Arxiv

0+阅读 · 2023年5月19日

Marginalized Beam Search Algorithms for Hierarchical HMMs

Arxiv

0+阅读 · 2023年5月19日

Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

Arxiv

0+阅读 · 2023年5月19日

Post Hoc Explanations of Language Models Can Improve Language Models

Arxiv

0+阅读 · 2023年5月19日

Your diffusion model secretly knows the dimension of the data manifold

Arxiv

0+阅读 · 2023年5月18日

Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Arxiv

2+阅读 · 2023年5月18日

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Arxiv

0+阅读 · 2023年5月17日

Transformers in Remote Sensing: A Survey

Transformers in Remote Sensing: A Survey

Arxiv

25+阅读 · 2022年9月2日

A Survey of Visual Transformers

Arxiv

39+阅读 · 2021年11月11日

VIP会员

文章信息

相关主题

蛋白质工程

蛋白序列设计

最新内容

《无人系统互操作性导论——无人系统联合架构（JAUS）》

《无人系统互操作性导论——无人系统联合架构（JAUS）》

专知会员服务

2+阅读 · 今天5:53

美空军新型反无人机部队初探

美空军新型反无人机部队初探

专知会员服务

1+阅读 · 今天5:45

《对抗性电磁环境下远程巡飞弹作战的安全指挥与控制数据链》

《对抗性电磁环境下远程巡飞弹作战的安全指挥与控制数据链》

专知会员服务

2+阅读 · 今天5:23

《北约下一代建模与仿真（NexGen M&S）计划》2026年69页

《北约下一代建模与仿真（NexGen M&S）计划》2026年69页

专知会员服务

1+阅读 · 今天5:11

《防空交战流程的概率建模研究》

《防空交战流程的概率建模研究》

专知会员服务

4+阅读 · 今天5:04

ICML 2026 教程 | 数值优化理论还重要吗？

ICML 2026 教程 | 数值优化理论还重要吗？

专知会员服务

4+阅读 · 7月26日

ICM 2026 | 陶哲轩：人工智能时代的数学

ICM 2026 | 陶哲轩：人工智能时代的数学

专知会员服务

7+阅读 · 7月26日

《面向可扩展高韧性无人机集群网络的速度感知分层通信框架》

《面向可扩展高韧性无人机集群网络的速度感知分层通信框架》

专知会员服务

7+阅读 · 7月26日

《面向概率推理的可定制战术引擎及其在军事任务规划中的应用》

《面向概率推理的可定制战术引擎及其在军事任务规划中的应用》

专知会员服务

9+阅读 · 7月26日

《先进防空系统选型战略框架：基于巴基斯坦的实证启示》

《先进防空系统选型战略框架：基于巴基斯坦的实证启示》

专知会员服务

8+阅读 · 7月26日

《反无人机交战场景下的战斗归零研究》

《反无人机交战场景下的战斗归零研究》

专知会员服务

7+阅读 · 7月26日

霍尔木兹与不对称作战时代：水雷、无人系统与海军力量的重新定义

霍尔木兹与不对称作战时代：水雷、无人系统与海军力量的重新定义

专知会员服务

4+阅读 · 7月26日

博士论文 | 用代码结构感知方法推进代码大模型

博士论文 | 用代码结构感知方法推进代码大模型

专知会员服务

5+阅读 · 7月25日

综述 | 遥感多模态大模型：领域专用还是通用模型？

综述 | 遥感多模态大模型：领域专用还是通用模型？

专知会员服务

5+阅读 · 7月25日

《面向指挥控制训练与实时北约兼容数据分发的战术模拟器》

《面向指挥控制训练与实时北约兼容数据分发的战术模拟器》

专知会员服务

5+阅读 · 7月25日

相关VIP内容

Nat. Biotechnol. | 使用语言模型和深度学习的单序列蛋白质结构预测

Nat. Biotechnol. | 使用语言模型和深度学习的单序列蛋白质结构预测

专知会员服务

11+阅读 · 2022年10月17日

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

专知会员服务

12+阅读 · 2022年9月18日

Nat. Commun. | 用于蛋白质设计的深度无监督语言模型ProtGPT2

Nat. Commun. | 用于蛋白质设计的深度无监督语言模型ProtGPT2

专知会员服务

12+阅读 · 2022年8月1日

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

专知会员服务

27+阅读 · 2022年5月19日

【哈佛大学】使用AlphaFold估算蛋白质模型精度的最新技术，State-of-the-Art Estimation of Protein Model Accuracy using AlphaFold

【哈佛大学】使用AlphaFold估算蛋白质模型精度的最新技术，State-of-the-Art Estimation of Protein Model Accuracy using AlphaFold

专知会员服务

21+阅读 · 2022年3月14日

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

专知会员服务

17+阅读 · 2022年3月6日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

专知会员服务

96+阅读 · 2020年4月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

美空军新型反无人机部队初探

《北约下一代建模与仿真（NexGen M&S）计划》2026年69页

《无人系统互操作性导论——无人系统联合架构（JAUS）》

《对抗性电磁环境下远程巡飞弹作战的安全指挥与控制数据链》

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

GenomicAI

0+阅读 · 2022年5月14日

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

AI前线

12+阅读 · 2019年7月22日

一文详解Google最新NLP模型XLNet

一文详解Google最新NLP模型XLNet

PaperWeekly

18+阅读 · 2019年7月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

中文版-BERT-预训练的深度双向Transformer语言模型-详细介绍

中文版-BERT-预训练的深度双向Transformer语言模型-详细介绍

深度学习与NLP

32+阅读 · 2019年3月30日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

相关论文

Scaling laws for language encoding models in fMRI

Arxiv

0+阅读 · 2023年5月19日

Reducing Sequence Length by Predicting Edit Operations with Large Language Models

Arxiv

0+阅读 · 2023年5月19日

Marginalized Beam Search Algorithms for Hierarchical HMMs

Arxiv

0+阅读 · 2023年5月19日

Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

Arxiv

0+阅读 · 2023年5月19日

Post Hoc Explanations of Language Models Can Improve Language Models

Arxiv

0+阅读 · 2023年5月19日

Your diffusion model secretly knows the dimension of the data manifold

Arxiv

0+阅读 · 2023年5月18日

Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Arxiv

2+阅读 · 2023年5月18日

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Arxiv

0+阅读 · 2023年5月17日

Transformers in Remote Sensing: A Survey

Transformers in Remote Sensing: A Survey

Arxiv

25+阅读 · 2022年9月2日

A Survey of Visual Transformers

Arxiv

39+阅读 · 2021年11月11日

相关基金

基于结构与序列信息的蛋白质-配体结合位点的预测

国家自然科学基金

8+阅读 · 2015年12月31日

以大豆粕蛋白为原料可控制备蛋白质基表面活性剂机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

小分子TRF2抑制剂的设计、合成及表征

国家自然科学基金

0+阅读 · 2014年12月31日

微流场中蚕丝蛋白结构变化定量研究

国家自然科学基金

0+阅读 · 2013年12月31日

新型维甲酸受体（RAR）激动剂的筛选及其功能调控机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

负载碳多孔有机插层LDHs的组装及对氯酚的增强吸附机理与选择性

国家自然科学基金

0+阅读 · 2012年12月31日

蛋白质与蛋白质的结合位点结构比对方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

IRES调控EV71神经毒性的分子机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

核磁共振研究蛋白质内含子的溶液性质与蛋白剪接机理

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员