Large language models (LLMs) have made tremendous progress in natural language understanding and they have also been successfully adopted in other domains such as computer vision, robotics, reinforcement learning, etc. In this work, we apply LLMs to image generation tasks by directly generating the virtual brush strokes to paint an image. We present Painter, an LLM that can convert user prompts in text description format to sketches by generating the corresponding brush strokes in an auto-regressive way. We construct Painter based on off-the-shelf LLM that is pre-trained on a large text corpus, by fine-tuning it on the new task while preserving language understanding capabilities. We create a dataset of diverse multi-object sketches paired with textual prompts that covers several object types and tasks. Painter can generate sketches from text descriptions, remove objects from canvas, and detect and classify objects in sketches. Although this is an unprecedented pioneering work in using LLMs for auto-regressive image generation, the results are very encouraging.
翻译:大语言模型(LLMs)在自然语言理解方面取得了巨大进展,并已成功应用于计算机视觉、机器人学、强化学习等其他领域。在本工作中,我们将LLMs应用于图像生成任务,通过直接生成虚拟画笔笔触来绘制图像。我们提出了Painter——一种能自动回归地生成相应笔触,将文本描述形式的用户提示转换为草图的LLM。Painter基于预训练于大规模文本语料库的现成LLM构建,通过在新任务上微调该模型并保持语言理解能力实现。我们创建了一个包含多样化多目标草图及对应文本提示的数据集,覆盖多种物体类型与任务。Painter能够根据文本描述生成草图、从画布中移除物体,并对草图中的物体进行检测与分类。尽管这是将LLMs用于自回归图像生成的开拓性工作,其成果仍令人振奋。