Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.


翻译:目的:临床文档中存在事实性、诊断性和管理性错误,可能危及患者安全。大型语言模型(LLMs)或有助于检测和修正此类错误,但其在不同提示策略下的行为尚不明确。我们评估了零样本提示、随机示例静态提示(SPR)和检索增强动态提示(RDP)在医疗错误处理的三个子任务中的表现:错误标记检测、错误语句检测和错误修正。方法:使用MEDEC数据集,我们评估了九个指令调优的LLMs(GPT、Claude、Gemini和OpenAI o系列模型)。我们采用准确率、召回率、假阳性率(FPR)以及ROUGE-1、BLEURT和BERTScore的综合得分来衡量错误修正性能。同时,我们通过分析示例输出,识别了失败模式以及LLM与临床医生推理之间的差异。结果:零样本提示在两项检测任务中召回率较低,常遗漏缩写密集或非典型错误。SPR提高了召回率但增加了FPR。在所有九个LLMs中,RDP将FPR降低了约15%,在错误语句检测中召回率提升了5%至10%,并生成了更具上下文准确性的修正。结论:在不同LLMs中,RDP均优于零样本和SPR提示。使用检索示例提高了检测准确率,减少了假阳性,并增强了医疗错误修正的可靠性。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员