W.J.T. Mitchell's influential essay 'What do pictures want?' shifts the theoretical focus away from the interpretative act of understanding pictures and from the motivations of the humans who create them to the possibility that the picture itself is an entity with agency and wants. In this article, I reframe Mitchell's question in light of contemporary AI image generation tools to ask: what do AI-generated images want? Drawing from art historical discourse on the nature of abstraction, I argue that AI-generated images want specificity and concreteness because they are fundamentally abstract. Multimodal text-to-image models, which are the primary subject of this article, are based on the premise that text and image are interchangeable or exchangeable tokens and that there is a commensurability between them, at least as represented mathematically in data. The user pipeline that sees textual input become visual output, however, obscures this representational regress and makes it seem like one form transforms into the other -- as if by magic.
翻译:W.J.T.米切尔颇具影响力的论文《图像想要什么?》将理论焦点从理解图像的阐释行为及人类创作者的动机,转向了图像本身作为具有能动性与欲求之实体的可能性。本文结合当代AI图像生成工具对米切尔的问题进行重构:AI生成的图像想要什么?借鉴艺术史中关于抽象本质的论述,我认为AI生成图像渴求具体性与实在性,因为它们在本质上是抽象的。本文主要探讨的多模态文生图模型基于以下前提:文本与图像是可互换或可交换的符号,且二者间存在可通约性——至少在数据层面的数学表征中是如此。然而,从文本输入到视觉输出的用户流程掩盖了这种表征的递归性,使其看似一种形式通过魔法般转化为另一种形式。