Clinical natural language processing (NLP) models have shown promise for supporting hospital discharge planning by leveraging narrative clinical documentation. However, note-based models are particularly vulnerable to temporal and lexical leakage, where documentation artifacts encode future clinical decisions and inflate apparent predictive performance. Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety. This study focuses on system-level design choices required to build safe and deployable clinical NLP under temporal leakage constraints. We present a lightweight auditing pipeline that integrates interpretability into the model development process to identify and suppress leakage-prone signals prior to final training. Using next-day discharge prediction after elective spine surgery as a case study, we evaluate how auditing affects predictive behavior, calibration, and safety-relevant trade-offs. Results show that audited models exhibit more conservative and better-calibrated probability estimates, with reduced reliance on discharge-related lexical cues. These findings emphasize that deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance.
翻译:临床自然语言处理(NLP)模型通过利用叙事性临床文档,在支持医院出院规划方面展现出潜力。然而,基于临床记录的模型特别容易受到时序与词汇泄漏的影响——即文档记录中编码了未来的临床决策,从而虚增了表面预测性能。这种行为在实际部署中会带来重大风险,过度自信或时序无效的预测可能干扰临床工作流程并危及患者安全。本研究聚焦于在时序泄漏约束下构建安全可部署临床NLP所需的系统级设计选择。我们提出一个轻量级审计流程,将可解释性集成至模型开发过程中,以在最终训练前识别并抑制易泄漏信号。以择期脊柱手术后次日出院预测作为案例研究,我们评估了审计如何影响预测行为、校准度及安全相关权衡。结果表明,经审计的模型展现出更保守且校准更优的概率估计,同时减少了对出院相关词汇线索的依赖。这些发现强调,具备部署条件的临床NLP系统应优先考虑时序有效性、校准度与行为鲁棒性,而非追求乐观的性能指标。