Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.
翻译:大规模文本到图像(T2I)扩散模型在开放域合成方面表现出色,但在精确文本渲染方面仍存在困难,尤其是对于多行布局、密集排版以及长尾文字(如中文)。现有解决方案通常需要昂贵的重新训练或严格的外部布局约束,这可能损害美学效果并限制灵活性。我们提出 **FreeText**,一种免训练、即插即用的框架,通过利用 **扩散Transformer(DiT)** 模型的内在机制来改善文本渲染。**FreeText** 将问题分解为 **何处书写** 与 **书写内容**。对于 **何处书写**,我们通过读取来自内生图像到文本注意力的令牌级空间归因来定位书写区域,使用汇点式令牌作为稳定的空间锚点,并通过拓扑感知细化生成高置信度掩码。对于 **书写内容**,我们引入频谱调制字形注入(SGMI),该方法注入一个与噪声对齐的字形先验,并辅以频域带通调制,以增强字形结构并抑制语义泄漏(渲染概念而非文字本身)。在 Qwen-Image、FLUX.1-dev 和 SD3 变体上,针对 longText-Benchmark、CVTG 以及我们提出的 CLT-Bench 进行的大量实验表明,该方法在文本可读性方面取得了一致的提升,同时很大程度上保持了语义对齐和美学质量,且推理开销适度。