This research focuses on evaluating the non-commercial open-source large language models (LLMs) Meditron, MedAlpaca, Mistral, and Llama-2 for their efficacy in interpreting medical guidelines saved in PDF format. As a specific test scenario, we applied these models to the guidelines for hypertension in children and adolescents provided by the European Society of Cardiology (ESC). Leveraging Streamlit, a Python library, we developed a user-friendly medical document chatbot tool (MedDoc-Bot). This tool enables authorized users to upload PDF files and pose questions, generating interpretive responses from four locally stored LLMs. A pediatric expert provides a benchmark for evaluation by formulating questions and responses extracted from the ESC guidelines. The expert rates the model-generated responses based on their fidelity and relevance. Additionally, we evaluated the METEOR and chrF metric scores to assess the similarity of model responses to reference answers. Our study found that Llama-2 and Mistral performed well in metrics evaluation. However, Llama-2 was slower when dealing with text and tabular data. In our human evaluation, we observed that responses created by Mistral, Meditron, and Llama-2 exhibited reasonable fidelity and relevance. This study provides valuable insights into the strengths and limitations of LLMs for future developments in medical document interpretation. Open-Source Code: https://github.com/yaseen28/MedDoc-Bot
翻译:本研究聚焦于评估非商业开源大语言模型(LLMs)Meditron、MedAlpaca、Mistral 和 Llama-2 在解读 PDF 格式保存的医学指南中的效能。作为特定测试场景,我们将这些模型应用于欧洲心脏病学会(ESC)提供的儿童及青少年高血压指南。利用 Python 库 Streamlit,我们开发了一款用户友好的医学文档聊天工具(MedDoc-Bot)。该工具允许授权用户上传 PDF 文件并提问,生成来自四个本地存储 LLM 的解读回复。一位儿科专家通过制定问题以及从 ESC 指南中提取的答案,为评估提供基准。专家根据模型生成回复的忠实度和相关性对其进行评分。此外,我们评估了 METEOR 和 chrF 指标分数,以衡量模型回复与参考答案的相似度。本研究发现,Llama-2 和 Mistral 在指标评估中表现良好。然而,Llama-2 在处理文本和表格数据时速度较慢。在我们的人工评估中,我们观察到 Mistral、Meditron 和 Llama-2 生成的回复展现出合理的忠实度和相关性。这项研究为 LLM 在医学文档解读领域的未来发展提供了关于其优势与局限性的宝贵见解。开源代码地址:https://github.com/yaseen28/MedDoc-Bot