Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).
翻译:摘要:媒体设计分层生成技术使得仅通过自然语言提示即可创建完全可编辑的分层设计文档,如海报、传单和标志。现有方法要么将输出限制为固定数量的图层,要么要求每个图层仅包含空间连续区域,导致图层数量随设计复杂度线性增长。我们提出LaDe(分层媒体设计),一种潜扩散框架,能够生成灵活数量的具有语义意义的图层。LaDe结合三个组件:基于大语言模型的提示扩展器,将简短用户意图转化为结构化逐层描述以引导生成;配备4D旋转位置编码机制的潜扩散Transformer,联合生成完整媒体设计及其构成RGBA图层;以及支持全Alpha通道解码每个图层的RGBA变分自编码器。通过在训练中基于图层样本进行条件约束,我们的统一框架支持三项任务:文本到图像生成、文本到分层媒体设计生成及媒体设计分解。我们在Crello测试集上,将LaDe与Qwen-Image-Layered在文本到图层和图像到图层任务上进行对比。LaDe在文本到图层生成任务中通过改善文本-图层对齐效果超越Qwen-Image-Layered,该结论经两个视觉大语言模型评判器(GPT-4o mini和Qwen3-VL)验证。