Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

Large language models (LLMs) have made significant advancements in natural language processing (NLP). Broad corpora capture diverse patterns but can introduce irrelevance, while focused corpora enhance reliability by reducing misleading information. Training LLMs on focused corpora poses computational challenges. An alternative approach is to use a retrieval-augmentation (RetA) method tested in a specific domain. To evaluate LLM performance, OpenAI's GPT-3, GPT-4, Bing's Prometheus, and a custom RetA model were compared using 19 questions on diffuse large B-cell lymphoma (DLBCL) disease. Eight independent reviewers assessed responses based on accuracy, relevance, and readability (rated 1-3). The RetA model performed best in accuracy (12/19 3-point scores, total=47) and relevance (13/19, 50), followed by GPT-4 (8/19, 43; 11/19, 49). GPT-4 received the highest readability scores (17/19, 55), followed by GPT-3 (15/19, 53) and the RetA model (11/19, 47). Prometheus underperformed in accuracy (34), relevance (32), and readability (38). Both GPT-3.5 and GPT-4 had more hallucinations in all 19 responses compared to the RetA model and Prometheus. Hallucinations were mostly associated with non-existent references or fabricated efficacy data. These findings suggest that RetA models, supplemented with domain-specific corpora, may outperform general-purpose LLMs in accuracy and relevance within specific domains. However, this evaluation was limited to specific questions and metrics and may not capture challenges in semantic search and other NLP tasks. Further research will explore different LLM architectures, RetA methodologies, and evaluation methods to assess strengths and limitations more comprehensively.

翻译：大型语言模型（LLMs）在自然语言处理（NLP）领域取得了显著进展。广泛语料库能捕捉多样模式，但可能引入无关信息；而聚焦性语料库通过减少误导性信息增强了可靠性。在聚焦性语料库上训练LLMs面临计算挑战。另一种替代方法是使用在特定领域测试过的检索增强（RetA）方法。为评估LLM性能，本研究将OpenAI的GPT-3、GPT-4、Bing的Prometheus以及自定义RetA模型在19个关于弥漫性大B细胞淋巴瘤（DLBCL）疾病的问题上进行对比。八名独立评审员根据准确性、相关性和可读性（评分1-3分）对回答进行评估。RetA模型在准确性（12/19获得3分，总分47）和相关性（13/19，总分50）方面表现最佳，其次是GPT-4（8/19，43分；11/19，49分）。GPT-4获得最高可读性评分（17/19，55分），其次是GPT-3（15/19，53分）和RetA模型（11/19，47分）。Prometheus在准确性（34分）、相关性（32分）和可读性（38分）方面表现不佳。与RetA模型和Prometheus相比，GPT-3.5和GPT-4在所有19个回答中均出现更多幻觉现象。幻觉主要表现为引用不存在的参考文献或编造虚假疗效数据。这些结果表明，补充领域特异性语料库的RetA模型在特定领域的准确性和相关性方面可能优于通用型LLMs。但本次评估局限于特定问题和指标，可能无法反映语义搜索及其他NLP任务中的挑战。未来研究将探索不同的LLM架构、RetA方法论及评估方法，以更全面地评估其优势与局限性。