Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an "in the wild" split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.
翻译:草图通过指定结构、轮廓和空间关系,为设计师提供了早期时尚构思的简洁而富有表现力的媒介,而文本描述则通过传达材质、色彩和风格细节来补充草图。要有效结合文本与视觉模态,需在利用文本局部属性引导时遵循草图的视觉结构。我们提出了具有多层级引导的局部文本与草图框架,该框架通过将全局草图引导与多个局部草图-文本对相结合,增强了时尚图像生成能力。LOTS采用多层级条件化阶段,在共享潜在空间中独立编码局部特征,同时保持全局结构协调。随后,扩散配对引导阶段通过基于注意力的引导机制,在扩散模型的多步去噪过程中整合局部与全局条件。为验证方法,我们构建了首个为每张图像提供多组文本-草图对的时尚数据集Sketchy。该数据集提供具有专业外观和一致结构的高质量清晰草图。为评估超出此设置的鲁棒性,我们还包含了包含非专业草图的"野外"数据划分,其具有更高的变异性和不完美特征。实验表明,我们的方法在利用更丰富的局部语义引导的同时强化了全局结构遵循性,相较现有最优方法实现了性能提升。数据集、平台及代码均已公开。