While large language models (LLMs) have been successfully applied to various tasks, they still face challenges with hallucinations. Augmenting LLMs with domain-specific tools such as database utilities can facilitate easier and more precise access to specialized knowledge. In this paper, we present GeneGPT, a novel method for teaching LLMs to use the Web APIs of the National Center for Biotechnology Information (NCBI) for answering genomics questions. Specifically, we prompt Codex to solve the GeneTuring tests with NCBI Web APIs by in-context learning and an augmented decoding algorithm that can detect and execute API calls. Experimental results show that GeneGPT achieves state-of-the-art performance on eight tasks in the GeneTuring benchmark with an average score of 0.83, largely surpassing retrieval-augmented LLMs such as the new Bing (0.44), biomedical LLMs such as BioMedLM (0.08) and BioGPT (0.04), as well as GPT-3 (0.16) and ChatGPT (0.12). Our further analyses suggest that: (1) API demonstrations have good cross-task generalizability and are more useful than documentations for in-context learning; (2) GeneGPT can generalize to longer chains of API calls and answer multi-hop questions in GeneHop, a novel dataset introduced in this work; (3) Different types of errors are enriched in different tasks, providing valuable insights for future improvements.
翻译:尽管大语言模型(LLMs)已成功应用于各种任务,但仍面临幻觉问题的挑战。通过数据库工具等特定领域工具增强LLMs,有助于更便捷、精准地获取专业知识。本文提出GeneGPT——一种通过教导LLMs使用美国国家生物技术信息中心(NCBI)网络应用程序接口(API)回答基因组学问题的新方法。具体而言,我们采用上下文学习与增强解码算法(可检测并执行API调用),引导Codex模型利用NCBI Web API解决GeneTuring测试。实验结果显示,GeneGPT在GeneTuring基准测试的八个任务中取得最优性能,平均得分达0.83,大幅超越检索增强型LLMs(如新版Bing的0.44)、生物医学LLMs(如BioMedLM的0.08与BioGPT的0.04),以及GPT-3(0.16)和ChatGPT(0.12)。进一步分析表明:(1)API示例具有良好跨任务泛化能力,在上下文学习中比文档资料更具实用性;(2)GeneGPT可泛化至更长的API调用链,并回答本研究新引入数据集GeneHop中的多跳问题;(3)不同任务中错误类型分布各异,为未来改进提供了重要启示。