Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation

Felipe Akio Matsuoka,Eduardo Moreno J. M. Farina,Augusto Sarquis Serpa,Soraya Monteiro,Rodrigo Ragazzini,Nitamar Abdala,Marcelo Straus Takahashi,Felipe Campos Kitamura

from arxiv, 8 pages, 4 figures

Generative foundation models can remove visual artifacts through realistic image inpainting, but their impact on medical AI performance remains uncertain. Pediatric hand radiographs often contain non-anatomical markers, and it is unclear whether inpainting these regions preserves features needed for bone age and gender prediction. To evaluate the clinical reliability of generative model-based inpainting for artifact removal, we used the RSNA Bone Age Challenge dataset, selecting 200 original radiographs and generating 600 inpainted versions with gpt-image-1 using natural language prompts to target non-anatomical artifacts. Downstream performance was assessed with deep learning ensembles for bone age estimation and gender classification, using mean absolute error (MAE) and area under the ROC curve (AUC) as metrics, and pixel intensity distributions to detect structural alterations. Inpainting markedly degraded model performance: bone age MAE increased from 6.26 to 30.11 months, and gender classification AUC decreased from 0.955 to 0.704. Inpainted images displayed pixel-intensity shifts and inconsistencies, indicating structural modifications not corrected by simple calibration. These findings show that, although visually realistic, foundation model-based inpainting can obscure subtle but clinically relevant features and introduce latent bias even when edits are confined to non-diagnostic regions, underscoring the need for rigorous, task-specific validation before integrating such generative tools into clinical AI workflows.

翻译：生成式基础模型能够通过逼真的图像修复去除视觉伪影，但其对医学人工智能性能的影响尚不明确。儿科手部X光片常包含非解剖学标记，目前尚不清楚修复这些区域是否会保留骨龄和性别预测所需的特征。为评估基于生成式模型的伪影修复在临床上的可靠性，我们使用RSNA骨龄挑战数据集，选取200张原始X光片，并利用gpt-image-1模型通过自然语言提示生成600张修复版本，以针对非解剖学伪影进行修复。通过深度学习集成模型评估下游性能，包括骨龄估计和性别分类，采用平均绝对误差（MAE）和受试者工作特征曲线下面积（AUC）作为评估指标，并利用像素强度分布检测结构变化。修复显著降低了模型性能：骨龄MAE从6.26个月增至30.11个月，性别分类AUC从0.955降至0.704。修复后的图像显示出像素强度偏移和不一致性，表明存在未通过简单校准纠正的结构性修改。这些发现表明，尽管视觉上逼真，基于基础模型的图像修复可能掩盖细微但具有临床相关性的特征，并引入潜在偏差，即使编辑仅限于非诊断区域。这强调了在将此类生成式工具整合到临床人工智能工作流之前，需进行严格、任务特异性的验证。