Fine-tuning Large Language Model (LLM) Artificial Intelligence Chatbots in Ophthalmology and LLM-based evaluation using GPT-4

Purpose: To assess the alignment of GPT-4-based evaluation to human clinician experts, for the evaluation of responses to ophthalmology-related patient queries generated by fine-tuned LLM chatbots. Methods: 400 ophthalmology questions and paired answers were created by ophthalmologists to represent commonly asked patient questions, divided into fine-tuning (368; 92%), and testing (40; 8%). We find-tuned 5 different LLMs, including LLAMA2-7b, LLAMA2-7b-Chat, LLAMA2-13b, and LLAMA2-13b-Chat. For the testing dataset, additional 8 glaucoma QnA pairs were included. 200 responses to the testing dataset were generated by 5 fine-tuned LLMs for evaluation. A customized clinical evaluation rubric was used to guide GPT-4 evaluation, grounded on clinical accuracy, relevance, patient safety, and ease of understanding. GPT-4 evaluation was then compared against ranking by 5 clinicians for clinical alignment. Results: Among all fine-tuned LLMs, GPT-3.5 scored the highest (87.1%), followed by LLAMA2-13b (80.9%), LLAMA2-13b-chat (75.5%), LLAMA2-7b-Chat (70%) and LLAMA2-7b (68.8%) based on the GPT-4 evaluation. GPT-4 evaluation demonstrated significant agreement with human clinician rankings, with Spearman and Kendall Tau correlation coefficients of 0.90 and 0.80 respectively; while correlation based on Cohen Kappa was more modest at 0.50. Notably, qualitative analysis and the glaucoma sub-analysis revealed clinical inaccuracies in the LLM-generated responses, which were appropriately identified by the GPT-4 evaluation. Conclusion: The notable clinical alignment of GPT-4 evaluation highlighted its potential to streamline the clinical evaluation of LLM chatbot responses to healthcare-related queries. By complementing the existing clinician-dependent manual grading, this efficient and automated evaluation could assist the validation of future developments in LLM applications for healthcare.

翻译：目的：评估基于GPT-4的评估与临床人类专家评价之间的一致性，以评估经微调的大语言模型（LLM）聊天机器人针对眼科患者常见问题生成的回答质量。方法：由眼科医师创建400组眼科相关问题及其配对答案，代表患者常见疑问。数据集分为微调集（368组，占92%）和测试集（40组，占8%）。我们微调了5种不同的大语言模型，包括LLAMA2-7b、LLAMA2-7b-Chat、LLAMA2-13b和LLAMA2-13b-Chat。测试集额外纳入8组青光眼问答对。5种微调后的LLM模型针对测试集生成200个回答以供评估。采用定制化临床评估量表引导GPT-4评估，该量表基于临床准确性、相关性、患者安全性和易理解性四大维度。随后将GPT-4评价结果与5位临床医师的排序结果进行临床一致性比较。结果：在所有微调后的LLM模型中，基于GPT-4评估，GPT-3.5得分最高（87.1%），其次依次为LLAMA2-13b（80.9%）、LLAMA2-13b-chat（75.5%）、LLAMA2-7b-Chat（70%）和LLAMA2-7b（68.8%）。GPT-4评估与人类临床医师排序呈现显著一致性：斯皮尔曼相关系数为0.90，肯德尔等级相关系数为0.80；但基于Cohen's Kappa系数的一致性较低（0.50）。值得注意的是，定性分析和青光眼亚组分析揭示了LLM生成回答中的临床不准确性，而GPT-4评估能够准确识别这些问题。结论：GPT-4评估展现的显著临床一致性凸显其简化LLM聊天机器人医疗相关问答临床评估流程的潜力。通过补充现有依赖临床医师的人工评分体系，这种高效的自动化评估方法可助力验证LLM在医疗领域应用的未来发展。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日