Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs). We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to "draw" using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks. By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.
翻译:草图绘制作为一种多功能工具,可用于将想法外化,实现跨学科的快速探索与视觉交流。尽管人工智能系统在内容创作和人机交互领域取得了显著进展,但捕捉人类草图绘制的动态性与抽象本质仍具挑战性。本研究提出SketchAgent,一种语言驱动的序列化草图生成方法,允许用户通过动态对话式交互创建、修改和优化草图。该方法无需训练或微调,而是利用现成多模态大语言模型(LLMs)的序列化特性与丰富先验知识。我们设计了一种直观的草图绘制语言,通过上下文示例引入模型,使其能够使用基于字符串的动作进行“绘制”。这些动作被处理为矢量图形后渲染至像素画布,生成可再次调用的草图。通过逐笔绘制,我们的智能体捕捉了草图绘制固有的演化动态特性。实验表明,SketchAgent能够根据多样化提示生成草图,进行对话式绘图,并与人类用户实现有意义的协作。