As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation. We hope our work encourages future progress on interpretable/explainable generation and evaluation for T2I models. Website: https://vp-t2i.github.io
翻译:随着大语言模型在众多领域展现出卓越性能,近期研究开始采用语言模型作为视觉模块控制器以完成视觉与语言任务。不同于现有工作聚焦于赋予语言模型视觉理解能力,我们提出两个面向文本到图像生成与评估的新型可解释/可阐明视觉编程框架。首先,我们提出VPGen——一个可解释的逐步式文本到图像生成框架,将生成过程分解为三个步骤:对象/数量生成、布局生成与图像生成。通过使用文本-布局对微调语言模型,我们使其处理前两个步骤(对象/数量生成与布局生成)。相较于该领域主流的端到端模型,这种逐步式文本到图像生成框架提供了更强的空间控制能力。此外,我们利用预训练语言模型的世界知识,克服了先前布局引导式文本到图像方法只能处理预定义对象类别的局限。实验证明,VPGen在对象数量/空间关系/尺度控制上优于当前最先进的文本到图像生成模型。其次,我们提出VPEval——基于视觉编程的可解释可阐明文本到图像评估框架。不同于以往采用单一评分模型(虽能精准评估某些技能但其他领域可靠性欠佳)的文本到图像评估方法,VPEval生成评估程序并调用一组精通不同技能的视觉模块,同时提供评估结果的视觉+文本化解释。分析表明,与广泛使用的单模型评估相比,VPEval在技能特定提示与开放式提示的评估中呈现出更强的人类相关性。我们期望这项工作能推动文本到图像模型在可解释/可阐明生成与评估领域的未来发展。网站:https://vp-t2i.github.io