TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

翻译：摘要：忠实文本渲染仍然是大型文本到图像生成模型的持久弱点，因为它既需要语义指令遵循，又需要细粒度字形级结构。先前方法通常通过架构专用模块或编码器修改来提升此能力，但这会复杂化基础模型的部署。我们将文本渲染视为一个训练后偏好对齐问题，并提出TextAlign——一种非侵入式框架，保持生成器架构不变。其关键组件是基于分层视觉-语言模型（VLM）的奖励机制，将渲染错误分解为全局、单词和字形三个层级，并将二元缺陷判断转化为标量偏好信号。该信号同时支持分组相对策略优化（GRPO）和直接偏好优化（DPO）。在FLUX.1-dev和Z-Image-Turbo上的实验表明，在不降低整体生成质量的前提下，基于OCR的文本准确率获得持续提升。与包括SD3.5、Qwen-Image、AnyText和TextDiffuser在内的强基础模型和文本渲染基线相比，这些结果证明了奖励设计可作为模型重设计的可扩展替代方案，用于改进文本渲染。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ICML2025】层级对齐：在视觉语言模型中检验图像编码器层的安全对齐

专知会员服务

7+阅读 · 2025年5月2日