Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results, thus highlighting evasion efficiency trade-offs relevant for practical deployment. These findings align with advancements in LLM capabilities toward professional-level medical reasoning and reflect the increasing feasibility of LLM-supported QA systems in the real clinical environments. This benchmark aims to serve as a standardized setting for future study to minimize model size, computational resources and to maximize clinical utility in medical NLP applications.
翻译:近年来,大型语言模型(LLMs)在医学领域获得了广泛关注,特别是在开发医学问答系统以提升资源匮乏地区的医疗可及性方面。本文比较了2024年4月至2025年8月期间部署的五个LLMs在医学问答任务中的表现,使用的iCliniq数据集包含来自不同专科的38,000个医学问题与答案。评估模型包括Llama-3-8B-Instruct、Llama 3.2 3B、Llama 3.3 70B Instruct、Llama-4-Maverick-17B-128E-Instruct以及GPT-5-mini。我们采用零样本评估方法,并利用BLEU和ROUGE指标衡量模型在未经专业微调情况下的性能。结果显示,Llama 3.3 70B Instruct等较大模型的表现优于较小模型,这与临床任务中观察到的规模效益规律一致。值得注意的是,Llama-4-Maverick-17B展现出更具竞争力的结果,突显了实际部署中效率与性能的权衡关系。这些发现印证了LLMs在专业级医学推理能力方面的进展,反映了LLM支持的问答系统在真实临床环境中日益增长的可行性。本基准研究旨在为未来医学自然语言处理应用提供标准化评估框架,以最小化模型规模与计算资源消耗,同时最大化临床实用价值。