VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation

Sketching is inherently a sequential process, in which strokes are drawn in a meaningful order to explore and refine ideas. However, most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing. We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models to generate sketching processes. Our key insight is that large language models and video diffusion models offer complementary strengths for this task: LLMs provide semantic planning and stroke ordering, while video diffusion models serve as strong renderers that produce high-quality, temporally coherent visuals. We leverage this by representing sketches as short videos in which strokes are progressively drawn on a blank canvas, guided by text-specified ordering instructions. We introduce a two-stage fine-tuning strategy that decouples the learning of stroke ordering from the learning of sketch appearance. Stroke ordering is learned using synthetic shape compositions with controlled temporal structure, while visual appearance is distilled from as few as seven manually authored sketching processes that capture both global drawing order and the continuous formation of individual strokes. Despite the extremely limited amount of human-drawn sketch data, our method generates high-quality sequential sketches that closely follow text-specified orderings while exhibiting rich visual detail. We further demonstrate the flexibility of our approach through extensions such as brush style conditioning and autoregressive sketch generation, enabling additional controllability and interactive, collaborative drawing.

翻译：草图绘制本质上是一个序列过程，其中笔触按照有意义的顺序被绘制以探索和完善创意。然而，大多数生成模型将草图视为静态图像，忽视了创造性绘图所依赖的时间结构。我们提出了一种数据高效的序列草图生成方法，该方法通过适配预训练的文本到视频扩散模型来生成草图绘制过程。我们的核心洞见是，大型语言模型和视频扩散模型为此任务提供了互补的优势：LLM提供语义规划和笔触顺序，而视频扩散模型则作为强大的渲染器，生成高质量、时间连贯的视觉内容。我们通过将草图表示为短视频来实现这一点，其中笔触在空白画布上逐步绘制，并遵循文本指定的顺序指令进行引导。我们引入了一种两阶段微调策略，将笔触顺序的学习与草图外观的学习解耦。笔触顺序的学习使用具有受控时间结构的合成形状组合，而视觉外观则从少至七个手动创作的草图绘制过程中蒸馏得到，这些过程同时捕捉了全局绘制顺序和单个笔触的连续形成。尽管人类绘制的草图数据量极其有限，我们的方法仍能生成高质量的序列草图，这些草图紧密遵循文本指定的顺序，同时展现出丰富的视觉细节。我们进一步通过扩展应用（如笔刷风格条件控制和自回归草图生成）展示了我们方法的灵活性，从而实现了额外的可控性以及交互式、协作式绘图。