Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information

Joshua Harris,Fan Grayson,Felix Feldman,Timothy Laurence,Toby Nonnenmacher,Oliver Higgins,Leo Loman,Selina Patel,Thomas Finnie,Samuel Collins,Michael Borowitz

from arxiv, 27 pages, 9 pages main text

As Large Language Models (LLMs) become widely accessible, a detailed understanding of their knowledge within specific domains becomes necessary for successful real world use. This is particularly critical in the domains of medicine and public health, where failure to retrieve relevant, accurate, and current information could significantly impact UK residents. However, while there are a number of LLM benchmarks in the medical domain, currently little is known about LLM knowledge within the field of public health. To address this issue, this paper introduces a new benchmark, PubHealthBench, with over 8000 questions for evaluating LLMs' Multiple Choice Question Answering (MCQA) and free form responses to public health queries. To create PubHealthBench we extract free text from 687 current UK government guidance documents and implement an automated pipeline for generating MCQA samples. Assessing 24 LLMs on PubHealthBench we find the latest proprietary LLMs (GPT-4.5, GPT-4.1 and o1) have a high degree of knowledge, achieving >90% accuracy in the MCQA setup, and outperform humans with cursory search engine use. However, in the free form setup we see lower performance with no model scoring >75%. Therefore, while there are promising signs that state of the art (SOTA) LLMs are an increasingly accurate source of public health information, additional safeguards or tools may still be needed when providing free form responses.

翻译：随着大语言模型（LLMs）的广泛普及，深入了解其在特定领域的知识对于实际应用至关重要。这在医学和公共卫生领域尤为关键，因为未能检索到相关、准确和最新的信息可能对英国居民产生重大影响。然而，尽管医学领域已有若干LLM基准测试，但目前对LLM在公共卫生领域的知识掌握情况知之甚少。为解决这一问题，本文引入了一个新的基准测试——PubHealthBench，包含超过8000个问题，用于评估LLM在公共卫生查询中的多项选择题问答（MCQA）和自由形式回答能力。为构建PubHealthBench，我们从687份现行英国政府指导文件中提取纯文本，并实现了一个自动生成MCQA样本的流程。通过对24个LLM在PubHealthBench上的评估，我们发现最新的专有LLM（GPT-4.5、GPT-4.1和o1）具有较高的知识水平，在MCQA设置中准确率超过90%，且优于仅进行粗略搜索引擎检索的人类。然而，在自由形式回答设置中，所有模型的性能均较低，无任何模型得分超过75%。因此，尽管有迹象表明最先进的（SOTA）LLM正日益成为准确的公共卫生信息来源，但在提供自由形式回答时，可能仍需额外的保障措施或工具。