Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at https://github.com/xiaxinz/FontUSE.
翻译:当前文本到图像模型能够根据自然语言提示生成高质量图像,然而对排版的控制仍然具有挑战性:所请求的排版外观常被忽略或仅被弱遵循。我们通过一种以数据为中心的方法来解决这一局限,该方法利用专门针对排版设计的结构化标注流程所衍生的定向监督来训练图像生成模型。我们的流程构建了一个大规模聚焦排版的标注数据集FontUse,包含约7万张图像,每张图像均标注了用户友好的提示、文本区域位置以及OCR识别字符串。这些标注通过分割模型和多模态大语言模型自动生成。提示信息明确结合了字体风格(如衬线体、手写体、优雅体)与使用场景(如婚礼请柬、咖啡厅菜单),使得即使新手用户也能直观地进行指定。利用这些标注对现有生成器进行微调,可使模型在无需修改架构的情况下,持续将风格与用例条件作为文本提示进行解释。为进行评估,我们引入了一种基于Long-CLIP的度量方法,用于量化生成排版与请求属性之间的对齐程度。在多样化提示和布局上的实验表明,采用本流程训练的模型相比竞争基线,能生成与提示更一致的文本渲染效果。本标注流程的源代码已发布于https://github.com/xiaxinz/FontUSE。