Understanding protein interactions and pathway knowledge is crucial for unraveling the complexities of living systems and investigating the underlying mechanisms of biological functions and complex diseases. While existing databases provide curated biological data from literature and other sources, they are often incomplete and their maintenance is labor-intensive, necessitating alternative approaches. In this study, we propose to harness the capabilities of large language models to address these issues by automatically extracting such knowledge from the relevant scientific literature. Toward this goal, in this work, we investigate the effectiveness of different large language models in tasks that involve recognizing protein interactions, pathways, and gene regulatory relations. We thoroughly evaluate the performance of various models, highlight the significant findings, and discuss both the future opportunities and the remaining challenges associated with this approach. The code and data are available at: https://github.com/boxorange/BioIE-LLM
翻译:理解蛋白质相互作用与通路知识对揭示生命系统的复杂性、探究生物功能及复杂疾病的潜在机制至关重要。现有数据库虽通过文献及其他来源提供经整理的生物学数据,但常存在数据不完整及维护劳动密集等问题,亟需替代方案。本研究提出利用大型语言模型的能力,通过自动从相关科学文献中提取此类知识来应对上述挑战。为此,我们系统探究了不同类型的大型语言模型在识别蛋白质相互作用、通路及基因调控关系任务中的有效性。我们对多种模型进行了全面性能评估,重点阐述了关键发现,并讨论了该方法的未来机遇与现存挑战。相关代码与数据可通过以下链接获取:https://github.com/boxorange/BioIE-LLM