Graphics design is important for various applications, including movie production and game design. To create a high-quality scene, designers usually need to spend hours in software like Blender, in which they might need to interleave and repeat operations, such as connecting material nodes, hundreds of times. Moreover, slightly different design goals may require completely different sequences, making automation difficult. In this paper, we propose a system that leverages Vision-Language Models (VLMs), like GPT-4V, to intelligently search the design action space to arrive at an answer that can satisfy a user's intent. Specifically, we design a vision-based edit generator and state evaluator to work together to find the correct sequence of actions to achieve the goal. Inspired by the role of visual imagination in the human design process, we supplement the visual reasoning capabilities of VLMs with "imagined" reference images from image-generation models, providing visual grounding of abstract language descriptions. In this paper, we provide empirical evidence suggesting our system can produce simple but tedious Blender editing sequences for tasks such as editing procedural materials from text and/or reference images, as well as adjusting lighting configurations for product renderings in complex scenes.
翻译:图形设计在电影制作和游戏设计等众多应用领域中至关重要。为创建高质量场景,设计师通常需在Blender等软件中耗费数小时,其间可能需要交错重复执行数百次操作(如连接材质节点)。此外,细微的设计目标差异可能要求完全不同的操作序列,这使得自动化变得困难。本文提出一种系统,利用GPT-4V等视觉语言模型智能搜索设计动作空间,以生成满足用户意图的解决方案。具体而言,我们设计了基于视觉的编辑生成器与状态评估器,通过协同工作寻找实现目标的正确动作序列。受人类设计过程中视觉想象作用的启发,我们通过图像生成模型提供的"想象"参考图像来增强视觉语言模型的视觉推理能力,为抽象语言描述提供视觉基础。本文提供的实验证据表明,我们的系统能够为以下任务生成简单但繁琐的Blender编辑序列:基于文本和/或参考图像编辑程序化材质,以及在复杂场景中调整产品渲染的光照配置。