Recent advances in pixel-based text modeling show that representing text as images enables models to exploit visual cues for language understanding. Grounding text in its visual form allows structurally similar characters with different Unicode encodings to produce similar embeddings, benefiting cross-lingual and zero-shot scenarios. Conventional text-based approaches treat each character independently, limiting generalization to unseen characters and requiring embedding expansion during cross-lingual adaptation. We propose Pixel-TTS, the first framework for visually grounded speech synthesis. It renders text as images and projects them through a 2D convolutional layer to generate embeddings. This design eliminates embedding matrix expansion during fine-tuning while improving robustness to unseen characters and orthographic variations. Extensive experiments show Pixel-TTS achieves competitive performance with strong baselines, faster convergence and robust zero-shot generalization.
翻译:近期基于像素的文本建模研究表明,将文本表示为图像可使模型利用视觉线索进行语言理解。将文本嵌入其视觉形式,能使具有不同Unicode编码但结构相似的字符产生相似的嵌入表示,这有利于跨语言和零样本场景的应用。传统基于文本的方法独立处理每个字符,限制了模型对未见字符的泛化能力,并在跨语言适配时需要扩展嵌入矩阵。我们提出Pixel-TTS——首个视觉感知语音合成框架。该方法将文本渲染为图像,并通过二维卷积层投影生成嵌入表示。该设计在微调过程中无需扩展嵌入矩阵,同时提升了模型对未见字符及正字法变体的鲁棒性。大量实验表明,Pixel-TTS在达到与强基线模型相当性能的同时,展现出更快的收敛速度和稳健的零样本泛化能力。