Planner-Auditor Twin: Agentic Discharge Planning with FHIR-Based LLM Planning, Guideline Recall, Optional Caching and Self-Improvement

Objective: Large language models (LLMs) show promise for clinical discharge planning, but their use is constrained by hallucination, omissions, and miscalibrated confidence. We introduce a self-improving, cache-optional Planner-Auditor framework that improves safety and reliability by decoupling generation from deterministic validation and targeted replay. Materials and Methods: We implemented an agentic, retrospective, FHIR-native evaluation pipeline using MIMIC-IV-on-FHIR. For each patient, the Planner (LLM) generates a structured discharge action plan with an explicit confidence estimate. The Auditor is a deterministic module that evaluates multi-task coverage, tracks calibration (Brier score, ECE proxies), and monitors action-distribution drift. The framework supports two-tier self-improvement: (i) within-episode regeneration when enabled, and (ii) cross-episode discrepancy buffering with replay for high-confidence, low-coverage cases. Results: While context caching improved performance over baseline, the self-improvement loop was the primary driver of gains, increasing task coverage from 32% to 86%. Calibration improved substantially, with reduced Brier/ECE and fewer high-confidence misses. Discrepancy buffering further corrected persistent high-confidence omissions during replay. Discussion: Feedback-driven regeneration and targeted replay act as effective control mechanisms to reduce omissions and improve confidence reliability in structured clinical planning. Separating an LLM Planner from a rule-based, observational Auditor enables systematic reliability measurement and safer iteration without model retraining. Conclusion: The Planner-Auditor framework offers a practical pathway toward safer automated discharge planning using interoperable FHIR data access and deterministic auditing, supported by reproducible ablations and reliability-focused evaluation.

翻译：目的：大型语言模型（LLM）在临床出院规划中展现出潜力，但其应用受限于幻觉、遗漏和置信度校准不足。我们提出了一种可自我改进、缓存可选的双生体框架，通过将生成过程与确定性验证及定向回放解耦，提升安全性与可靠性。材料与方法：我们基于MIMIC-IV-on-FHIR实现了一个智能、回顾性、原生支持FHIR的评估流程。针对每位患者，规划者（LLM）生成带有明确置信度估计的结构化出院行动计划。审计者是一个确定性模块，用于评估多任务覆盖率、跟踪校准指标（Brier分数、ECE代理）并监测行动分布漂移。该框架支持双层自我改进机制：（i）在启用时进行单次就诊内的重新生成；（ii）对高置信度、低覆盖率的病例进行跨就诊差异缓冲与回放。结果：虽然上下文缓存较基线提升了性能，但自我改进循环是效果提升的主要驱动力，将任务覆盖率从32%提高至86%。校准效果显著改善，Brier分数和ECE降低，高置信度遗漏减少。差异缓冲机制在回放过程中进一步纠正了持续存在的高置信度遗漏问题。讨论：反馈驱动的重新生成与定向回放可作为有效控制机制，减少结构化临床规划中的遗漏并提升置信度可靠性。将LLM规划者与基于规则的观察性审计者分离，能够实现系统化的可靠性度量，并在无需重新训练模型的情况下进行更安全的迭代。结论：规划者-审计者框架通过可互操作的FHIR数据访问和确定性审计，结合可复现的消融实验与以可靠性为核心的评估，为更安全的自动化出院规划提供了可行路径。