The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using 450 chest X-ray reports from the Indiana University dataset, we generate synthetic versions via three realistic LLM rewriting tasks: EHR summarization, standardized rewriting, and teaching case preparation. We measure entity erosion (via medical NER), hedging collapse (loss of clinical uncertainty language), and cross-modal alignment degradation (via BiomedCLIP image-text similarity). Our central finding is a dissociation between information loss and cross-modal fidelity. EHR summarization is the most destructive at the content level, eroding 51.4% of clinical entities and 43.7% of hedging language, yet it preserves image-text alignment almost entirely (a 2.5% drop). The two tasks meant to produce cleaner training data, standardized rewriting and teaching case preparation, do the reverse: they preserve more entities (26.8% and 29.3% eroded) but cause 14.9-16.5% alignment drops, six to seven times those of EHR summarization. We term this the slop paradox: rewriting that makes clinical text look cleaner for multimodal training is precisely what pulls it away from the image. Contrary to our pre-specified hypothesis, rare pathologies were not preferentially degraded: across nine rare-versus-common comparisons, no difference survived multiple-comparison correction, and nominal differences ran in the opposite direction (common > rare), so contamination is invisible to condition-specific monitoring. The dominant determinant of degradation is the type of AI rewriting task, not the clinical content. These findings bear on multimodal medical AI dataset construction and the governance of AI-assisted clinical documentation.

翻译：AI辅助临床文档工具日益采用大型语言模型（LLMs）对放射报告进行摘要、标准化及格式重写。我们通过受控测量揭示了由此产生的信息退化机制。基于印第安纳大学数据集中的450份胸部X光报告，我们通过三种真实的LLM重写任务生成合成版本：电子健康记录摘要、标准化重写及教学案例准备。通过医学命名实体识别（NER）测量实体侵蚀度，通过临床不确定性语言损失评估 hedging 崩溃程度，并利用BiomedCLIP图像-文本相似度衡量跨模态对齐退化。核心发现是信息损失与跨模态保真度之间存在解耦：电子健康记录摘要在内容层面最具破坏性，侵蚀51.4%的临床实体及43.7%的 hedging 语言，但其图像-文本对齐几乎完全保留（仅下降2.5%）。而旨在生成更清洁训练数据的标准化重写与教学案例准备任务则呈现相反趋势——它们保留更多实体（侵蚀率分别为26.8%和29.3%），却导致14.9%-16.5%的对齐度下降，是电子健康记录摘要的六至七倍。我们将此现象称为Slop悖论：使临床文本在多模态训练中看似更洁净的重写，恰恰使其偏离图像内容。与预设假设相反，罕见病理并未优先退化：在九项罕见vs常见对比中，无差异经受住多重比较校正，且标称差异呈反向趋势（常见>罕见），因此污染无法通过条件特异性监测识别。退化的主导决定因素是AI重写任务类型而非临床内容本身。这些发现对多模态医学AI数据集构建及AI辅助临床文档治理具有重要启示。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

多模态检索增强生成的综合综述

专知会员服务

44+阅读 · 2025年2月17日

【新书】用于医疗保健的大型语言模型和生成式AI：下一片前沿领域

专知会员服务

53+阅读 · 2024年11月10日

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日

RAG+LLM=？同济大学等最新《大型语言模型的检索增强生成》综述

专知会员服务

111+阅读 · 2023年12月19日