Recent years have witnessed remarkable progress in generative AI, with natural language emerging as the most common conditioning input. As underlying models grow more powerful, researchers are exploring increasingly diverse conditioning signals, such as depth maps, edge maps, camera parameters, and reference images, to give users finer control over generation. Among different modalities, sketches are a natural and long-standing form of human communication, enabling rapid expression of visual concepts. Previous literature has largely focused on edge maps, often misnamed 'sketches', yet algorithms that effectively handle true freehand sketches, with their inherent abstraction and distortions, remain underexplored. We pursue the challenging goal of balancing photorealism with sketch adherence when generating images from freehand input. A key obstacle is the absence of ground-truth, pixel-aligned images: by their nature, freehand sketches do not have a single correct alignment. To address this, we propose a modulation-based approach that prioritizes semantic interpretation of the sketch over strict adherence to individual edge positions. We further introduce a novel loss that enables training on freehand sketches without requiring ground-truth pixel-aligned images. We show that our method outperforms existing approaches in both semantic alignment with freehand sketch inputs and in the realism and overall quality of the generated images.
翻译:近年来,生成式人工智能取得了显著进展,其中自然语言已成为最常见的条件输入。随着底层模型日益强大,研究者们正在探索愈发多样化的条件信号,例如深度图、边缘图、相机参数和参考图像,以使用户能够更精细地控制生成过程。在不同模态中,草图是一种自然而长久存在的人类交流形式,能够快速表达视觉概念。现有文献主要聚焦于边缘图(常被误称为“草图”),而能够有效处理真实手绘草图(具有其固有的抽象性和形变特性)的算法仍未得到充分探索。我们致力于实现从手绘输入生成图像时,在照片级真实感与草图忠实度之间取得平衡这一挑战性目标。一个关键障碍在于缺乏真实像素对齐的图像:就其本质而言,手绘草图并不存在唯一正确的对齐方式。为解决这一问题,我们提出了一种基于调制的方法,该方法优先考虑对草图的语义理解,而非严格遵循单个边缘位置。我们进一步引入了一种新颖的损失函数,使得无需真实像素对齐图像即可在手绘草图上进行训练。实验表明,我们的方法在手绘草图输入的语义对齐度、生成图像的真实感及整体质量方面均优于现有方法。