Understanding protein interactions and pathway knowledge is crucial for unraveling the complexities of living systems and investigating the underlying mechanisms of biological functions and complex diseases. While existing databases provide curated biological data from literature and other sources, they are often incomplete and their maintenance is labor-intensive, necessitating alternative approaches. In this study, we propose to harness the capabilities of large language models to address these issues by automatically extracting such knowledge from the relevant scientific literature. Toward this goal, in this work, we investigate the effectiveness of different large language models in tasks that involve recognizing protein interactions, identifying genes associated with pathways affected by low-dose radiation, and gene regulatory relations. We thoroughly evaluate the performance of various models, highlight the significant findings, and discuss both the future opportunities and the remaining challenges associated with this approach. The code and data are available at: https://github.com/boxorange/BioIE-LLM
翻译:理解蛋白质相互作用及通路知识对于揭示生命系统的复杂性、探究生物功能与复杂疾病的潜在机制至关重要。现有数据库虽能从文献及其他来源提供经过整理的生物数据,但往往存在信息不完整的问题,且其维护工作劳动强度大,因此需要探索替代方法。本研究提出利用大型语言模型的能力,通过自动从相关科学文献中提取此类知识来解决上述问题。为此,我们探究了不同大型语言模型在识别蛋白质相互作用、鉴定受低剂量辐射影响的通路相关基因以及基因调控关系等任务中的有效性。我们对多种模型的性能进行了全面评估,突出了重要研究发现,并讨论了该方法面临的未来机遇与现存挑战。相关代码与数据可在以下网址获取:https://github.com/boxorange/BioIE-LLM