Large Language Models (LLMs) have shown promise in medical question answering by achieving passing scores in standardised exams and have been suggested as tools for supporting healthcare workers. Deploying LLMs into such a high-risk context requires a clear understanding of the limitations of these models. With the rapid development and release of new LLMs, it is especially valuable to identify patterns which exist across models and may, therefore, continue to appear in newer versions. In this paper, we evaluate a wide range of popular LLMs on their knowledge of medical questions in order to better understand their properties as a group. From this comparison, we provide preliminary observations and raise open questions for further research.
翻译:大语言模型(LLM)在医学问答领域展现出前景,通过标准化考试及格分数证明了其能力,并被建议作为支持医疗工作者的工具。将LLM部署到这种高风险情境中,需要清晰理解这些模型的局限性。随着新LLM的快速开发与发布,识别跨模型存在的、可能在新版本中持续出现的模式尤为重要。本文评估了多种主流LLM在医学问题上的知识水平,以更全面地理解它们作为群体的特性。基于这一比较,我们提出初步观察结果,并引发进一步研究的开放性问题。