Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency.
翻译:深度生成模型推动了文本到在线笔迹生成(TOHG)的发展,该任务旨在基于文本输入和风格参考合成真实的笔迹轨迹。然而,现有方法大多仍主要关注字符或单词级别的生成,导致在应用于完整文本行时效率低下且缺乏整体结构建模。为解决这些问题,我们提出了DiffInk,首个用于整行笔迹生成的潜在扩散Transformer框架。我们首先引入InkVAE,这是一种新颖的序列变分自编码器,通过两种互补的潜在空间正则化损失进行增强:(1)基于OCR的损失以强制字形准确性,(2)风格分类损失以保持书写风格。这种双重正则化产生了一个语义结构化的潜在空间,其中字符内容与书写者风格得到有效解耦。随后,我们引入InkDiT,一种新颖的潜在扩散Transformer,它整合目标文本与参考风格以生成连贯的笔迹轨迹。实验结果表明,DiffInk在字形准确性和风格保真度上均优于现有最先进方法,同时显著提升了生成效率。