BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models

Objective: Large language models (LLMs) are increasingly applied in biomedical settings, and existing benchmark datasets have played an important role in supporting model development and evaluation. However, these benchmarks often have limitations. Many rely on static or outdated datasets that fail to capture the dynamic, context-rich, and high-stakes nature of biomedical knowledge. They also carry increasing risk of data leakage due to overlap with model pretraining corpora and often overlook critical dimensions such as robustness to linguistic variation and potential demographic biases. Materials and Methods: To address these gaps, we introduce BioPulse-QA, a benchmark that evaluates LLMs on answering questions from newly published biomedical documents including drug labels, trial protocols, and clinical guidelines. BioPulse-QA includes 2,280 expert-verified question answering (QA) pairs and perturbed variants, covering both extractive and abstractive formats. We evaluate four LLMs - GPT-4o, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents. Results: GPT-o1 achieves the highest relaxed F1 score (0.92), followed by Gemini-2.0-Flash (0.90) on drug labels. Clinical trials are the most challenging source, with extractive F1 scores as low as 0.36. Discussion and Conclusion: Performance differences are larger for paraphrasing than for typographical errors, while bias testing shows negligible differences. BioPulse-QA provides a scalable and clinically relevant framework for evaluating biomedical LLMs.

翻译：目的：大语言模型（LLM）在生物医学领域的应用日益广泛，现有基准数据集对模型开发与评估起到了重要支撑作用。然而，这些基准常存在局限性：许多依赖静态或过时的数据集，无法捕捉生物医学知识动态、上下文丰富及高风险的特点；由于与模型预训练语料存在重叠，数据泄露风险日益增加；且往往忽略语言变异鲁棒性、潜在人口统计学偏差等关键维度。材料与方法：为弥补这些不足，我们提出了BioPulse-QA基准，该基准通过基于新近发表的生物医学文档（包括药品说明书、试验方案和临床指南）的问答任务来评估LLM。BioPulse-QA包含2,280个经专家验证的问答对及其扰动变体，涵盖抽取式和生成式两种形式。我们评估了在基准文档发布日期之前发布的四种LLM：GPT-4o、GPT-o1、Gemini-2.0-Flash和LLaMA-3.1 8B Instruct。结果：在药品说明书任务上，GPT-o1取得了最高的宽松F1分数（0.92），Gemini-2.0-Flash次之（0.90）。临床试验文档是最具挑战性的来源，其抽取式F1分数低至0.36。讨论与结论：模型在释义扰动上的性能差异大于在拼写错误扰动上的差异，而偏差测试显示差异可忽略不计。BioPulse-QA为评估生物医学LLM提供了一个可扩展且具有临床相关性的框架。