The quality of text-to-image generation is continuously improving, yet the boundaries of its applicability are still unclear. In particular, refinement of the text input with the objective of achieving better results - commonly called prompt engineering - so far seems to have not been geared towards work with pre-existing texts. We investigate whether text-to-image generation and prompt engineering could be used to generate basic illustrations of popular fairytales. Using Midjourney v4, we engage in action research with a dual aim: to attempt to generate 5 believable illustrations for each of 5 popular fairytales, and to define a prompt engineering process that starts from a pre-existing text and arrives at an illustration of it. We arrive at a tentative 4-stage process: i) initial prompt, ii) composition adjustment, iii) style refinement, and iv) variation selection. We also discuss three reasons why the generation model struggles with certain illustrations: difficulties with counts, bias from stereotypical configurations and inability to depict overly fantastic situations. Our findings are not limited to the specific generation model and are intended to be generalisable to future ones.
翻译:文本到图像生成的质量持续提升,但其应用边界仍不明确。特别是,通过优化文本输入以获得更好结果(通常称为提示工程)的方法,目前似乎尚未针对现有文本的处理进行专门设计。本研究探讨文本到图像生成与提示工程能否用于为流行童话生成基础插图。我们使用Midjourney v4开展行动研究,目标有二:尝试为5个流行童话各生成5幅可信插图,并定义一套从现有文本出发生成插图的提示工程流程。最终得出包含四个阶段的初步流程:i) 初始提示,ii) 构图调整,iii) 风格细化,iv) 变体选择。同时讨论了生成模型在处理特定插图时遭遇困难的三个原因:计数困难、刻板配置导致的偏差,以及无法描绘过于奇幻的场景。本研究结论不局限于特定生成模型,旨在对未来的生成模型具有普适性。