Text-to-image (T2I) models have achieved remarkable progress in high-quality image synthesis, yet most benchmarks rely on simple, self-contained prompts, failing to capture the complexity of real-world captions. Human-written captions often involve multiple interacting subjects, rich contextual references, and abstractive phrasing, conditions under which current image-text encoders like CLIP struggle. To systematically study these deficiencies, we introduce ANCHOR, a large-scale dataset of 70K+ abstractive captions sourced from five major news media organizations. Analysis with ANCHOR reveals persistent failures in multi-subject understanding, context reasoning, and nuanced grounding. Motivated by these challenges, we propose Subject-Aware Fine-tuning (SAFE), which uses Large Language Models (LLMs) to extract key subjects and enhance their representation at the embedding-level. Experiments with contemporary models show that SAFE significantly improves image-caption consistency and human preference alignment, serving as a practical and scalable solution.
翻译:文本到图像(T2I)模型在高品质图像合成方面取得了显著进展,然而大多数基准测试依赖简单、自包含的提示,未能捕捉真实世界描述的复杂性。人类编写的描述往往涉及多个交互主题、丰富的上下文指代及抽象化表述,在此类条件下,当前图像-文本编码器(如CLIP)表现欠佳。为系统研究这些缺陷,我们引入ANCHOR——一个包含7万余条来自五大新闻媒体机构的抽象化描述的大规模数据集。通过ANCHOR的分析揭示了多主题理解、上下文推理及细粒度语义基础方面的持续性失败。受这些挑战启发,我们提出主题感知微调(SAFE),该方法利用大型语言模型(LLM)提取关键主题,并在嵌入层增强其表征。针对当代模型的实验表明,SAFE显著提升了图像-描述一致性及人类偏好对齐,是一种实用且可扩展的解决方案。