AI-generated data contamination erodes pathological variability and diagnostic reliability

Hongyu He,Shaowen Xiang,Ye Zhang,Yingtao Zhu,Jin Zhang,Hao Deng,Emily Alsentzer,Qingyu Chen,Kun-Hsing Yu,Andrew Marshall,Tingting Chen,Srinivas Anumasa,Daniel Ebner,Dean Ho,Kee Yuan Ngiam,Ching-Yu Cheng,Dianbo Liu

from arxiv, *Corresponding author: Dianbo Liu ([email protected])

Generative artificial intelligence (AI) is rapidly populating medical records with synthetic content, creating a feedback loop where future models are increasingly at risk of training on uncurated AI-generated data. However, the clinical consequences of this AI-generated data contamination remain unexplored. Here, we show that in the absence of mandatory human verification, this self-referential cycle drives a rapid erosion of pathological variability and diagnostic reliability. By analysing more than 800,000 synthetic data points across clinical text generation, vision-language reporting, and medical image synthesis, we find that models progressively converge toward generic phenotypes regardless of the model architecture. Specifically, rare but critical findings, including pneumothorax and effusions, vanish from the synthetic content generated by AI models, while demographic representations skew heavily toward middle-aged male phenotypes. Crucially, this degradation is masked by false diagnostic confidence; models continue to issue reassuring reports while failing to detect life-threatening pathology, with false reassurance rates tripling to 40%. Blinded physician evaluation confirms that this decoupling of confidence and accuracy renders AI-generated documentation clinically useless after just two generations. We systematically evaluate three mitigation strategies, finding that while synthetic volume scaling fails to prevent collapse, mixing real data with quality-aware filtering effectively preserves diversity. Ultimately, our results suggest that without policy-mandated human oversight, the deployment of generative AI threatens to degrade the very healthcare data ecosystems it relies upon.

翻译：生成式人工智能正迅速在医疗记录中填充合成内容，形成一种反馈循环，使得未来模型越来越可能基于未经筛选的人工智能生成数据进行训练。然而，这种人工智能生成的数据污染对临床的影响尚未得到充分研究。本文表明，在缺乏强制性人工验证的情况下，这种自我参照的循环会迅速导致病理变异性和诊断可靠性的退化。通过分析临床文本生成、视觉语言报告和医学图像合成等领域的超过80万个合成数据点，我们发现无论模型架构如何，模型都逐渐趋同于通用表型。具体而言，包括气胸和积液在内的罕见但关键的发现从人工智能模型生成的合成内容中消失，而人口统计学表征则严重偏向于中年男性表型。至关重要的是，这种退化被虚假的诊断信心所掩盖；模型继续发布令人安心的报告，却未能检测到危及生命的病理，虚假安抚率上升至40%，增加了两倍。盲法医师评估证实，信心与准确性的这种脱节使得人工智能生成的文档仅经过两代迭代后就在临床上失去价值。我们系统评估了三种缓解策略，发现虽然扩大合成数据规模无法防止系统崩溃，但将真实数据与质量感知过滤相结合能有效保持多样性。最终，我们的结果表明，若无政策强制的人工监督，生成式人工智能的部署可能会损害其依赖的医疗数据生态系统本身。