OpenAIs HealthBench in Action: Evaluating an LLM-Based Medical Assistant on Realistic Clinical Queries

Sandhanakrishnan Ravichandran,Shivesh Kumar,Rogerio Corga Da Silva,Miguel Romano,Reinhard Berkels,Michiel van der Heijden,Olivier Fail,Valentine Emmanuel Gnanapragasam

from arxiv, 13 pages, two graphs

Evaluating large language models (LLMs) on their ability to generate high-quality, accurate, situationally aware answers to clinical questions requires going beyond conventional benchmarks to assess how these systems behave in complex, high-stakes clinical scenarios. Traditional evaluations are often limited to multiple-choice questions that fail to capture essential competencies such as contextual reasoning, contextual awareness, and uncertainty handling. To address these limitations, we evaluate our agentic RAG-based clinical support assistant, DR. INFO, using HealthBench, a rubric-driven benchmark composed of open-ended, expert-annotated health conversations. On the Hard subset of 1,000 challenging examples, DR. INFO achieves a HealthBench Hard score of 0.68, outperforming leading frontier LLMs including the GPT-5 model family (GPT-5: 0.46, GPT-5.2: 0.42, GPT-5.1: 0.40), Grok 3 (0.23), Gemini 2.5 Pro (0.19), and Claude 3.7 Sonnet (0.02) across all behavioral axes (accuracy, completeness, instruction following, etc.). In a separate 100-sample evaluation against similar agentic RAG assistants (OpenEvidence and Pathway.md, now DoxGPT by Doximity), it maintains a performance lead with a HealthBench Hard score of 0.72. These results highlight the strengths of DR. INFO in communication, instruction following, and accuracy, while also revealing areas for improvement in context awareness and response completeness. Overall, the findings underscore the utility of behavior-level, rubric-based evaluation for building reliable and trustworthy AI-enabled clinical support systems.

翻译：评估大型语言模型（LLM）生成高质量、准确且具备情境感知的临床问题答案的能力，需要超越传统基准测试，以评估这些系统在复杂、高风险的临床场景中的表现。传统评估通常局限于多项选择题，无法捕捉关键能力，如情境推理、情境感知和不确定性处理。为应对这些局限，我们使用HealthBench——一个由开放式、专家标注的健康对话组成的量规驱动基准——来评估我们基于智能体化检索增强生成（RAG）的临床支持助手DR. INFO。在包含1,000个挑战性示例的Hard子集上，DR. INFO获得了0.68的HealthBench Hard分数，在所有行为维度（准确性、完整性、指令遵循等）上均优于包括GPT-5模型家族（GPT-5: 0.46, GPT-5.2: 0.42, GPT-5.1: 0.40）、Grok 3 (0.23)、Gemini 2.5 Pro (0.19) 和 Claude 3.7 Sonnet (0.02) 在内的前沿领先LLM。在另一项针对类似智能体化RAG助手（OpenEvidence和Pathway.md，现为Doximity的DoxGPT）的100样本评估中，它保持了性能领先，HealthBench Hard分数为0.72。这些结果突显了DR. INFO在沟通、指令遵循和准确性方面的优势，同时也揭示了其在情境感知和回答完整性方面有待改进之处。总体而言，研究结果强调了基于行为层面和量规的评估对于构建可靠、值得信赖的AI赋能临床支持系统的实用性。