Large language model (LLM) systems are increasingly used to support high-stakes decision-making, but they typically perform worse when the available evidence is internally inconsistent. Such a scenario exists in real-world healthcare settings, with patient-reported symptoms contradicting medical signs. To study this problem, we introduce MIMIC-DOS, a dataset for short-horizon organ dysfunction worsening prediction in the intensive care unit (ICU) setting. We derive this dataset from the widely recognized MIMIC-IV, a publicly available electronic health record dataset, and construct it exclusively from cases in which discordance between signs and symptoms exists. This setting poses a substantial challenge for existing LLM-based approaches, with single-pass LLMs and agentic pipelines often struggling to reconcile such conflicting signals. To address this problem, we propose CARE: a multi-stage privacy-compliant agentic reasoning framework in which a remote LLM provides guidance by generating structured categories and transitions without accessing sensitive patient data, while a local LLM uses these categories and transitions to support evidence acquisition and final decision-making. Empirically, CARE achieves stronger performance across all key metrics compared to multiple baseline settings, showing that CARE can more robustly handle conflicting clinical evidence while preserving privacy.
翻译:大型语言模型系统越来越多地被用于支持高风险决策,但当可用证据存在内部不一致时,其表现通常更差。在真实医疗场景中,患者报告的症状与医学体征相互矛盾的情况普遍存在。为研究该问题,我们提出了MIMIC-DOS数据集,用于预测重症监护病房环境下的短期器官功能障碍恶化。该数据集源自广泛认可的公开电子健康记录数据集MIMIC-IV,并专门从体征与症状存在矛盾的病例中构建。这一场景对现有基于大型语言模型的方法构成了重大挑战,单次处理的大型语言模型和智能体流水线往往难以协调这类冲突信号。为解决该问题,我们提出CARE:一种多阶段、遵循隐私规范的智能体推理框架,其中远程大型语言模型通过生成结构化类别和转换规则提供指导而不访问敏感患者数据,本地大型语言模型则利用这些类别和转换规则来支持证据采集和最终决策。实验表明,与多种基线设置相比,CARE在所有关键指标上均取得更优性能,证明其能在保护隐私的同时更稳健地处理冲突性临床证据。