Large language models (LLMs) exhibit strong medical knowledge and can generate factually accurate responses. However, existing models often fail to account for individual patient contexts, producing answers that are clinically correct yet poorly aligned with patients' needs. In this work, we introduce DeCode, a training-free, model-agnostic framework that adapts existing LLMs to produce contextualized answers in clinical settings. We evaluate DeCode on OpenAI HealthBench, a comprehensive and challenging benchmark designed to assess clinical relevance and validity of LLM responses. DeCode improves the previous state of the art from $28.4\%$ to $49.8\%$, corresponding to a $75\%$ relative improvement. Experimental results suggest the effectiveness of DeCode in improving clinical question answering of LLMs.
翻译:大型语言模型(LLM)展现出强大的医学知识储备,能够生成事实准确的回答。然而,现有模型往往未能充分考虑个体患者的临床背景,导致生成的答案虽在医学上正确,却与患者实际需求契合度不足。本研究提出DeCode——一种无需训练、与模型无关的框架,能够适配现有LLM以生成符合临床场景的语境化回答。我们在OpenAI HealthBench(一个为评估LLM回答的临床相关性与有效性而设计的综合性高难度基准测试)上对DeCode进行评估。DeCode将现有最佳性能从$28.4\%$提升至$49.8\%$,相对提升幅度达$75\%$。实验结果表明,DeCode能有效提升LLM在临床问答任务中的表现。