We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.
翻译:据我们所知,我们首次提出了一种将图像到视频模型适配为图层感知文本(字形)动画的方法,这一能力对实际动态视觉设计至关重要。现有方法主要将透明度编码(Alpha通道)作为附加的潜在维度拼接至RGB空间,这需要重建以RGB为中心的变分自编码器(VAE)。然而,由于高质量透明字形数据的稀缺性,重新训练VAE不仅计算成本高昂,还可能侵蚀从海量RGB语料中学习到的鲁棒语义先验,导致潜在模式混合。为解决这些限制,我们提出TransText框架,其基于新颖的Alpha即RGB范式,在不修改预训练生成流形的前提下联合建模外观与透明度。TransText通过潜在空间拼接将Alpha通道嵌入为兼容RGB的视觉信号,在显式保证跨模态(RGB与Alpha)严格一致性的同时避免特征纠缠。实验表明,TransText显著优于基线方法,能够生成了具有多样化精细效果的一致、高保真透明动画。