Gene set analysis is a mainstay of functional genomics, but it relies on manually curated databases of gene functions that are incomplete and unaware of biological context. Here we evaluate the ability of OpenAI's GPT-4, a Large Language Model (LLM), to develop hypotheses about common gene functions from its embedded biomedical knowledge. We created a GPT-4 pipeline to label gene sets with names that summarize their consensus functions, substantiated by analysis text and citations. Benchmarking against named gene sets in the Gene Ontology, GPT-4 generated very similar names in 50% of cases, while in most remaining cases it recovered the name of a more general concept. In gene sets discovered in 'omics data, GPT-4 names were more informative than gene set enrichment, with supporting statements and citations that largely verified in human review. The ability to rapidly synthesize common gene functions positions LLMs as valuable functional genomics assistants.
翻译:基因集分析是功能基因组学的重要方法,但它依赖于人工策划的基因功能数据库,这些数据库不仅不完整,而且对生物学背景缺乏认知。本研究评估了OpenAI的GPT-4——一种大型语言模型(LLM)——利用其内置的生物医学知识对常见基因功能提出假设的能力。我们构建了一个GPT-4流程,通过名称标注基因集,这些名称能够概括其共识功能,并通过分析文本和引用加以佐证。以基因本体论中已命名的基因集为基准,GPT-4在50%的案例中生成的名称高度相似,而其余大多数情况下,它给出的则是更通用概念的名称。在组学数据中发现的基因集里,GPT-4提供的名称比基因集富集分析更具信息量,其附带的支撑性陈述和引用也基本能通过人工审查验证。这种快速综合常见基因功能的能力,使大型语言模型成为功能基因组学中极具价值的辅助工具。