Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help

Generative modeling is widely regarded as one of the most essential problems in today's AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability. Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.

翻译：生成建模被广泛视为当今人工智能领域最核心的问题之一，其中文本到图像生成已产生前所未有的现实影响。在各类方法中，扩散模型取得了显著成功，并已成为文本到图像生成的事实解决方案。然而，尽管其性能令人印象深刻，这些模型在遵循用户指令中的数值约束方面表现出根本性局限，常常生成物体数量错误的图像。虽然先前若干研究已提及此问题，但对此局限性的全面而严谨的评估仍然缺乏。为填补这一空白，我们提出了T2ICountBench——一个旨在严格评估最先进文本到图像扩散模型计数能力的新型基准。我们的基准涵盖了一系列多样化的生成模型，包括开源和私有系统。它明确将计数性能与其他能力分离，提供结构化的难度等级，并纳入人工评估以确保高可靠性。通过T2ICountBench的广泛评估表明，所有最先进的扩散模型均无法生成正确数量的物体，且准确率随物体数量增加而显著下降。此外，一项关于提示词优化的探索性研究表明，此类简单干预通常无法提升计数准确性。我们的研究结果凸显了扩散模型在数值理解方面的内在挑战，并指出了未来改进的潜在方向。