Learning the rules of peptide self-assembly through data mining with large language models

Peptides are ubiquitous and important biologically derived molecules, that have been found to self-assemble to form a wide array of structures. Extensive research has explored the impacts of both internal chemical composition and external environmental stimuli on the self-assembly behaviour of these systems. However, there is yet to be a systematic study that gathers this rich literature data and collectively examines these experimental factors to provide a global picture of the fundamental rules that govern protein self-assembly behavior. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and literature mining facilitated by a large language model. As a result, we collect more than 1,000 experimental data entries with information about peptide sequence, experimental conditions and corresponding self-assembly phases. Utilizing the collected data, ML models are trained and evaluated, demonstrating excellent accuracy (>80\%) and efficiency in peptide assembly phase classification. Moreover, we fine-tune our GPT model for peptide literature mining with the developed dataset, which exhibits markedly superior performance in extracting information from academic publications relative to the pre-trained model. We find that this workflow can substantially improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the mechanisms governing peptide self-assembly. In doing so, novel structures can be accessed for a range of applications including sensing, catalysis and biomaterials.

翻译：肽是普遍存在且重要的生物衍生分子，已被发现能通过自组装形成多种结构。大量研究探索了内部化学组成和外部环境刺激对这些系统自组装行为的影响。然而，目前尚缺乏系统性研究来整合这些丰富的文献数据，并综合考察这些实验因素以揭示调控蛋白质自组装行为基本规律的全景图。本研究通过结合专家人工处理与大语言模型辅助的文献挖掘，构建了肽组装数据库。由此，我们收集了超过1000条包含肽序列信息、实验条件及相应自组装相态的实验数据条目。利用所收集的数据，我们训练并评估了机器学习模型，其在肽组装相态分类中表现出优异的准确率（>80%）与效率。此外，我们使用构建的数据集对GPT模型进行肽文献挖掘的微调，该模型在从学术出版物中提取信息方面表现出显著优于预训练模型的性能。我们发现，该工作流程能通过指导实验工作，在探索潜在自组装肽候选物时大幅提升效率，同时深化对肽自组装调控机制的理解。借此，可为传感、催化和生物材料等一系列应用开发新型结构。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/