Large language models (LLMs) have shown remarkable performance on many tasks in different domains. However, their performance in closed-book biomedical machine reading comprehension (MRC) has not been evaluated in depth. In this work, we evaluate GPT on four closed-book biomedical MRC benchmarks. We experiment with different conventional prompting techniques as well as introduce our own novel prompting method. To solve some of the retrieval problems inherent to LLMs, we propose a prompting strategy named Implicit Retrieval Augmented Generation (RAG) that alleviates the need for using vector databases to retrieve important chunks in traditional RAG setups. Moreover, we report qualitative assessments on the natural language generation outputs from our approach. The results show that our new prompting technique is able to get the best performance in two out of four datasets and ranks second in rest of them. Experiments show that modern-day LLMs like GPT even in a zero-shot setting can outperform supervised models, leading to new state-of-the-art (SoTA) results on two of the benchmarks.
翻译:大型语言模型(LLM)在不同领域的多项任务中展现出卓越性能。然而,其在闭卷式生物医学机器阅读理解(MRC)任务中的表现尚未得到深入评估。本研究在四个闭卷式生物医学MRC基准数据集上对GPT进行了系统评估。我们尝试了多种传统提示技术,并提出了创新的提示方法。针对LLM固有的检索难题,我们提出了一种名为"隐式检索增强生成"的提示策略,该策略无需依赖传统RAG设置中的向量数据库来检索关键文本片段。此外,我们对方法生成的自然语言输出进行了定性评估。结果表明:新提示技术在四个数据集中有两个取得最佳性能,其余两个位列第二。实验证明,以GPT为代表的现代LLM即使在零样本设置下也能超越监督模型,在其中两个基准数据集上创造了新的最先进(SoTA)性能记录。