Structured benchmarks have advanced text-conditional image generation for real-world imagery, however, no such benchmark exists for synthetic radiograph generation. Despite being a highly active area of research, existing studies continue adopting inconsistent evaluation protocols and lack a unified assessment of the three most critical criteria: generative fidelity, privacy risk, and downstream utility. To address these limitations, we introduce CheXGenBench, the first unified evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and downstream utility across frontier text-to-image (T2I) generative models. Our evaluation protocol, comprising over 20 quantitative metrics, covers 11 leading T2I architectures with plug-and-play integration for newer models. Through a rigorous and fair evaluation protocol, we establish comprehensive baseline state-of-the-art (SoTA) performances across all dimensions to guide future research. Furthermore, our results uncover several limitations of current generative models, which include first, even SoTA models struggle with long-tailed medical distributions; second, models pose high privacy risks regardless of fidelity quality; and third, while synthetic data already benefits downstream classification, it is of limited utility for downstream multimodal tasks. Drawing from these results, we propose concrete research directions to advance the field. The code is available at https://github.com/Raman1121/CheXGenBench
翻译:结构化基准测试推动了真实世界图像中文本条件图像生成的发展,然而,目前尚无针对合成放射影像生成的此类基准。尽管这是一个高度活跃的研究领域,现有研究仍采用不一致的评估协议,且缺乏对三个最关键标准(生成保真度、隐私风险和下游实用性)的统一评估。为解决这些局限性,我们提出了CheXGenBench——首个面向合成胸片生成的统一评估框架,该框架能够同时评估前沿文本到图像生成模型在保真度、隐私风险和下游实用性方面的表现。我们的评估协议包含20余项定量指标,覆盖11种主流T2I架构,并支持新模型的即插即用集成。通过严格且公平的评估协议,我们建立了所有维度上的全面基线最先进性能,以指导未来研究。此外,我们的结果揭示了当前生成模型的若干局限性:首先,即使是最先进的模型也难以处理长尾医学分布;其次,无论保真度高低,模型均存在较高隐私风险;第三,尽管合成数据已有利于下游分类任务,但其对下游多模态任务的实用性有限。基于这些结果,我们提出了具体的研究方向以推动该领域发展。代码发布于 https://github.com/Raman1121/CheXGenBench