Large language models (LLMs) exhibit strong medical knowledge and can generate factually accurate responses. However, existing models often fail to account for individual patient contexts, producing answers that are clinically correct yet poorly aligned with patients' needs. In this work, we introduce DeCode, a training-free, model-agnostic framework that adapts existing LLMs to produce contextualized answers in clinical settings. We evaluate DeCode on OpenAI HealthBench, a comprehensive and challenging benchmark designed to assess clinical relevance and validity of LLM responses. DeCode improves the previous state of the art from $28.4\%$ to $49.8\%$, corresponding to a $75\%$ relative improvement. Experimental results suggest the effectiveness of DeCode in improving clinical question answering of LLMs.
翻译:大型语言模型(LLMs)展现出强大的医学知识储备,能够生成事实准确的回答。然而,现有模型往往未能充分考虑个体患者的背景信息,导致生成的答案虽在临床层面正确,却与患者实际需求契合度不足。本研究提出DeCode,一种无需训练、与模型无关的框架,旨在使现有LLMs适应临床场景,生成具有情境化特征的答案。我们在OpenAI HealthBench上对DeCode进行评估,该基准是一个全面且具有挑战性的测试集,专门用于评估LLM回答的临床相关性与有效性。DeCode将先前最佳性能从$28.4\%$提升至$49.8\%$,相当于$75\%$的相对改进。实验结果表明,DeCode在提升LLMs的临床问答能力方面具有显著效果。