Large-scale Text-to-image Generation Models (LTGMs) (e.g., DALL-E), self-supervised deep learning models trained on a huge dataset, have demonstrated the capacity for generating high-quality open-domain images from multi-modal input. Although they can even produce anthropomorphized versions of objects and animals, combine irrelevant concepts in reasonable ways, and give variation to any user-provided images, we witnessed such rapid technological advancement left many visual artists disoriented in leveraging LTGMs more actively in their creative works. Our goal in this work is to understand how visual artists would adopt LTGMs to support their creative works. To this end, we conducted an interview study as well as a systematic literature review of 72 system/application papers for a thorough examination. A total of 28 visual artists covering 35 distinct visual art domains acknowledged LTGMs' versatile roles with high usability to support creative works in automating the creation process (i.e., automation), expanding their ideas (i.e., exploration), and facilitating or arbitrating in communication (i.e., mediation). We conclude by providing four design guidelines that future researchers can refer to in making intelligent user interfaces using LTGMs.
翻译:大规模文本到图像生成模型(LTGMs)(如DALL-E)是在大规模数据集上训练的自监督深度学习模型,已展现出从多模态输入生成高质量开放域图像的能力。尽管这些模型能够生成拟人化的物体和动物形象,以合理方式结合无关概念,并对用户提供的任意图像进行变体创作,但我们观察到,这种技术快速发展使得许多视觉艺术家在更积极地利用LTGMs支持其创意作品时感到无所适从。本研究旨在理解视觉艺术家如何采用LTGMs辅助其创意工作。为此,我们通过访谈研究及对72篇系统/应用论文的系统文献综述进行了深入考察。覆盖35个视觉艺术领域的28位艺术家一致认为,LTGMs具有多功能角色和高可用性,可通过自动化创作过程(即自动化)、拓展创意构思(即探索),以及促进或调节沟通交流(即中介)来支持创意工作。最后,我们提出了四项设计指南,供未来研究者参考以构建基于LTGMs的智能用户界面。