There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. While valuable, these benchmarks fail to fully capture or assess the complexities of real-world clinical applications where LLMs are being deployed. Furthermore, existing studies on evaluating long-form answer generation in medical QA are primarily closed-source, lacking access to human medical expert annotations, which makes it difficult to reproduce results and enhance existing baselines. In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors. We performed pairwise comparisons of responses from various open and closed-source medical and general-purpose LLMs based on criteria such as correctness, helpfulness, harmfulness, and bias. Additionally, we performed a comprehensive LLM-as-a-judge analysis to study the alignment between human judgments and LLMs. Our preliminary results highlight the strong potential of open LLMs in medical QA compared to leading closed models. Code & Data: https://github.com/lavita-ai/medical-eval-sphere
翻译:目前缺乏用于评估大型语言模型在长格式医学问答中表现的基准测试。现有医学问答评估基准大多侧重于自动指标和多项选择题。尽管这些基准具有价值,但它们未能充分捕捉或评估实际临床应用中部署大型语言模型的复杂性。此外,现有关于评估医学问答中长格式答案生成的研究主要是闭源的,缺乏医学专家的人工标注,这使得结果难以复现且难以增强现有基线。在本研究中,我们引入了一个新的公开可用基准测试,该基准包含真实世界的消费者医学问题,并由医生对长格式答案进行评估标注。我们基于正确性、有用性、危害性和偏见性等标准,对来自各种开源和闭源的医学及通用大型语言模型的回答进行了成对比较。此外,我们进行了全面的"LLM即裁判"分析,以研究人类判断与大型语言模型之间的一致性。我们的初步结果表明,与领先的闭源模型相比,开源大型语言模型在医学问答中展现出巨大潜力。代码与数据:https://github.com/lavita-ai/medical-eval-sphere