MedDoc-Bot: A Chat Tool for Comparative Analysis of Large Language Models in the Context of the Pediatric Hypertension Guideline

from arxiv, {copyright} 2024 IEEE. This work has been accepted for publication and presentation at the 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, to be held in Orlando, Florida, USA, July 15-19, 2024

This research focuses on evaluating the non-commercial open-source large language models (LLMs) Meditron, MedAlpaca, Mistral, and Llama-2 for their efficacy in interpreting medical guidelines saved in PDF format. As a specific test scenario, we applied these models to the guidelines for hypertension in children and adolescents provided by the European Society of Cardiology (ESC). Leveraging Streamlit, a Python library, we developed a user-friendly medical document chatbot tool (MedDoc-Bot). This tool enables authorized users to upload PDF files and pose questions, generating interpretive responses from four locally stored LLMs. A pediatric expert provides a benchmark for evaluation by formulating questions and responses extracted from the ESC guidelines. The expert rates the model-generated responses based on their fidelity and relevance. Additionally, we evaluated the METEOR and chrF metric scores to assess the similarity of model responses to reference answers. Our study found that Llama-2 and Mistral performed well in metrics evaluation. However, Llama-2 was slower when dealing with text and tabular data. In our human evaluation, we observed that responses created by Mistral, Meditron, and Llama-2 exhibited reasonable fidelity and relevance. This study provides valuable insights into the strengths and limitations of LLMs for future developments in medical document interpretation. Open-Source Code: https://github.com/yaseen28/MedDoc-Bot

翻译：本研究聚焦于评估非商业开源大语言模型（LLMs）Meditron、MedAlpaca、Mistral 和 Llama-2 在解读 PDF 格式保存的医学指南中的效能。作为特定测试场景，我们将这些模型应用于欧洲心脏病学会（ESC）提供的儿童及青少年高血压指南。利用 Python 库 Streamlit，我们开发了一款用户友好的医学文档聊天工具（MedDoc-Bot）。该工具允许授权用户上传 PDF 文件并提问，生成来自四个本地存储 LLM 的解读回复。一位儿科专家通过制定问题以及从 ESC 指南中提取的答案，为评估提供基准。专家根据模型生成回复的忠实度和相关性对其进行评分。此外，我们评估了 METEOR 和 chrF 指标分数，以衡量模型回复与参考答案的相似度。本研究发现，Llama-2 和 Mistral 在指标评估中表现良好。然而，Llama-2 在处理文本和表格数据时速度较慢。在我们的人工评估中，我们观察到 Mistral、Meditron 和 Llama-2 生成的回复展现出合理的忠实度和相关性。这项研究为 LLM 在医学文档解读领域的未来发展提供了关于其优势与局限性的宝贵见解。开源代码地址：https://github.com/yaseen28/MedDoc-Bot

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日