Peptides are ubiquitous and important biologically derived molecules, that have been found to self-assemble to form a wide array of structures. Extensive research has explored the impacts of both internal chemical composition and external environmental stimuli on the self-assembly behaviour of these systems. However, there is yet to be a systematic study that gathers this rich literature data and collectively examines these experimental factors to provide a global picture of the fundamental rules that govern protein self-assembly behavior. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and literature mining facilitated by a large language model. As a result, we collect more than 1,000 experimental data entries with information about peptide sequence, experimental conditions and corresponding self-assembly phases. Utilizing the collected data, ML models are trained and evaluated, demonstrating excellent accuracy (>80\%) and efficiency in peptide assembly phase classification. Moreover, we fine-tune our GPT model for peptide literature mining with the developed dataset, which exhibits markedly superior performance in extracting information from academic publications relative to the pre-trained model. We find that this workflow can substantially improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the mechanisms governing peptide self-assembly. In doing so, novel structures can be accessed for a range of applications including sensing, catalysis and biomaterials.
翻译:肽是普遍存在且重要的生物衍生分子,已被发现能通过自组装形成多种结构。大量研究探索了内部化学组成和外部环境刺激对这些系统自组装行为的影响。然而,目前尚缺乏系统性研究来整合这些丰富的文献数据,并综合考察这些实验因素以揭示调控蛋白质自组装行为基本规律的全景图。本研究通过结合专家人工处理与大语言模型辅助的文献挖掘,构建了肽组装数据库。由此,我们收集了超过1000条包含肽序列信息、实验条件及相应自组装相态的实验数据条目。利用所收集的数据,我们训练并评估了机器学习模型,其在肽组装相态分类中表现出优异的准确率(>80%)与效率。此外,我们使用构建的数据集对GPT模型进行肽文献挖掘的微调,该模型在从学术出版物中提取信息方面表现出显著优于预训练模型的性能。我们发现,该工作流程能通过指导实验工作,在探索潜在自组装肽候选物时大幅提升效率,同时深化对肽自组装调控机制的理解。借此,可为传感、催化和生物材料等一系列应用开发新型结构。