Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

Large language models (LLMs) have made significant advancements in natural language processing (NLP). Broad corpora capture diverse patterns but can introduce irrelevance, while focused corpora enhance reliability by reducing misleading information. Training LLMs on focused corpora poses computational challenges. An alternative approach is to use a retrieval-augmentation (RetA) method tested in a specific domain. To evaluate LLM performance, OpenAI's GPT-3, GPT-4, Bing's Prometheus, and a custom RetA model were compared using 19 questions on diffuse large B-cell lymphoma (DLBCL) disease. Eight independent reviewers assessed responses based on accuracy, relevance, and readability (rated 1-3). The RetA model performed best in accuracy (12/19 3-point scores, total=47) and relevance (13/19, 50), followed by GPT-4 (8/19, 43; 11/19, 49). GPT-4 received the highest readability scores (17/19, 55), followed by GPT-3 (15/19, 53) and the RetA model (11/19, 47). Prometheus underperformed in accuracy (34), relevance (32), and readability (38). Both GPT-3.5 and GPT-4 had more hallucinations in all 19 responses compared to the RetA model and Prometheus. Hallucinations were mostly associated with non-existent references or fabricated efficacy data. These findings suggest that RetA models, supplemented with domain-specific corpora, may outperform general-purpose LLMs in accuracy and relevance within specific domains. However, this evaluation was limited to specific questions and metrics and may not capture challenges in semantic search and other NLP tasks. Further research will explore different LLM architectures, RetA methodologies, and evaluation methods to assess strengths and limitations more comprehensively.

翻译：大型语言模型（LLM）在自然语言处理（NLP）领域取得了显著进展。广泛语料库能捕捉多样化模式，但可能引入无关信息；而聚焦语料库通过减少误导性信息可增强可靠性。在聚焦语料库上训练LLM面临计算挑战。替代方法是采用经特定领域测试的检索增强（RetA）技术。为评估LLM性能，本研究使用弥漫性大B细胞淋巴瘤（DLBCL）疾病的19个问题，比较了OpenAI的GPT-3、GPT-4、Bing的Prometheus及自定义RetA模型。八位独立评审员基于准确性、相关性和可读性（评分1-3分）对回答进行评估。RetA模型在准确性（19题中12题获得3分，总分47分）和相关性（19题中13题获得3分，总分50分）方面表现最佳，其次为GPT-4（准确性8/19项3分，43分；相关性11/19项3分，49分）。GPT-4获得最高可读性评分（19题中17题获3分，总分55分），其次为GPT-3（15/19项3分，53分）和RetA模型（11/19项3分，47分）。Prometheus在准确性（34分）、相关性（32分）和可读性（38分）方面表现欠佳。与RetA模型和Prometheus相比，GPT-3.5和GPT-4在所有19个回答中出现更多幻觉。幻觉主要涉及不存在的参考文献或虚构的疗效数据。这些发现表明，通过领域特定语料库补充的RetA模型在特定领域内的准确性和相关性方面可能超越通用LLM。但本评估仅针对特定问题和指标，可能无法涵盖语义搜索及其他NLP任务中的挑战。未来研究将进一步探索不同LLM架构、RetA方法及评估策略，以更全面地评估其优势与局限性。