Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows. We propose \textbf{LayerT2V}, a unified multi-layer video generation framework that produces multiple semantically consistent outputs in a single inference pass: the full video, an independent background layer, and multiple foreground RGB layers with corresponding alpha mattes. Our key insight is that recent video generation backbones use high compression in both time and space, enabling us to serialize multiple layer representations along the temporal dimension and jointly model them on a shared generation trajectory. This turns cross-layer consistency into an intrinsic objective, improving semantic alignment and temporal coherence. To mitigate layer ambiguity and conditional leakage, we augment a shared DiT backbone with LayerAdaLN and layer-aware cross-attention modulation. LayerT2V is trained in three stages: alpha mask VAE adaptation, joint multi-layer learning, and multi-foreground extension. We also introduce \textbf{VidLayer}, the first large-scale dataset for multi-layer video generation. Extensive experiments demonstrate that LayerT2V substantially outperforms prior methods in visual fidelity, temporal consistency, and cross-layer coherence.
翻译:文本到视频生成技术发展迅速,但现有方法通常仅输出最终的合成视频,缺乏可编辑的分层表示,限制了其在专业工作流程中的应用。我们提出了\textbf{LayerT2V},一个统一的视频多图层生成框架,能够在单次推理过程中生成多个语义一致的输出:完整视频、独立的背景图层以及多个带有对应Alpha遮罩的前景RGB图层。我们的核心见解是,近期的视频生成主干网络在时间和空间维度均采用高压缩率,这使我们能够沿时间维度序列化多个图层表示,并在共享的生成轨迹上对它们进行联合建模。这将跨图层一致性转化为一个内在目标,从而提升了语义对齐和时间连贯性。为了缓解图层模糊性和条件泄漏问题,我们在共享的DiT主干网络中增强了LayerAdaLN模块和图层感知的交叉注意力调制机制。LayerT2V的训练分为三个阶段:Alpha遮罩VAE适配、联合多图层学习以及多前景扩展。我们还引入了\textbf{VidLayer},这是首个用于视频多图层生成的大规模数据集。大量实验表明,LayerT2V在视觉保真度、时间一致性以及跨图层连贯性方面均显著优于现有方法。