While large language models (LLMs) have been successfully applied to various tasks, they still face challenges with hallucinations and generating erroneous content. Augmenting LLMs with domain-specific tools such as database utilities has the potential to facilitate more precise and straightforward access to specialized knowledge. In this paper, we present GeneGPT, a novel method for teaching LLMs to use the Web Application Programming Interfaces (APIs) of the National Center for Biotechnology Information (NCBI) and answer genomics questions. Specifically, we prompt Codex (code-davinci-002) to solve the GeneTuring tests with few-shot URL requests of NCBI API calls as demonstrations for in-context learning. During inference, we stop the decoding once a call request is detected and make the API call with the generated URL. We then append the raw execution results returned by NCBI APIs to the generated texts and continue the generation until the answer is found or another API call is detected. Our preliminary results show that GeneGPT achieves state-of-the-art results on three out of four one-shot tasks and four out of five zero-shot tasks in the GeneTuring dataset. Overall, GeneGPT achieves a macro-average score of 0.76, which is much higher than retrieval-augmented LLMs such as the New Bing (0.44), biomedical LLMs such as BioMedLM (0.08) and BioGPT (0.04), as well as other LLMs such as GPT-3 (0.16) and ChatGPT (0.12).
翻译:尽管大语言模型已成功应用于多种任务,但其仍面临幻觉和生成错误内容的挑战。通过将大语言模型与数据库工具等特定领域工具相结合,有望实现更精准、便捷地获取专业领域知识。本文提出GeneGPT——一种使大语言模型能够使用美国国家生物技术信息中心(NCBI)网络应用程序编程接口并回答基因组学问题的新方法。具体而言,我们引导Codex(code-davinci-002)通过少量NCBI API调用的URL请求作为示例进行上下文学习,以完成GeneTuring测试。在推理阶段,一旦检测到调用请求即停止解码,并使用生成的URL执行API调用。随后将NCBI API返回的原始执行结果附加到生成文本中,继续生成过程直至找到答案或检测到新的API调用。初步实验表明,GeneGPT在GeneTuring数据集中四个一次性任务中的三个和五个零样本任务中的四个取得了最先进性能。总体而言,GeneGPT的宏平均得分为0.76,远高于检索增强型大语言模型(如New Bing的0.44)、生物医学大语言模型(如BioMedLM的0.08和BioGPT的0.04)以及其他大语言模型(如GPT-3的0.16和ChatGPT的0.12)。