With the digital imagery landscape rapidly evolving, image stocks and AI-generated image marketplaces have become central to visual media. Traditional stock images now exist alongside innovative platforms that trade in prompts for AI-generated visuals, driven by sophisticated APIs like DALL-E 3 and Midjourney. This paper studies the possibility of employing multi-modal models with enhanced visual understanding to mimic the outputs of these platforms, introducing an original attack strategy. Our method leverages fine-tuned CLIP models, a multi-label classifier, and the descriptive capabilities of GPT-4V to create prompts that generate images similar to those available in marketplaces and from premium stock image providers, yet at a markedly lower expense. In presenting this strategy, we aim to spotlight a new class of economic and security considerations within the realm of digital imagery. Our findings, supported by both automated metrics and human assessment, reveal that comparable visual content can be produced for a fraction of the prevailing market prices ($0.23 - $0.27 per image), emphasizing the need for awareness and strategic discussions about the integrity of digital media in an increasingly AI-integrated landscape. Our work also contributes to the field by assembling a dataset consisting of approximately 19 million prompt-image pairs generated by the popular Midjourney platform, which we plan to release publicly.
翻译:随着数字图像格局的快速发展,图库与AI生成图像市场已成为视觉媒体的核心组成部分。传统图库图像如今与交易AI生成视觉提示的创新平台并存,后者由DALL-E 3和Midjourney等先进API驱动。本文研究利用增强视觉理解能力的多模态模型模仿这些平台输出的可能性,并提出一种原创攻击策略。我们的方法结合了微调后的CLIP模型、多标签分类器及GPT-4V的描述能力,生成能够复现市场上及优质图库供应商图像但成本显著更低的提示。通过提出该策略,我们旨在揭示数字图像领域内一类新的经济与安全考量。基于自动化指标与人工评估的研究结果表明,以远低于现行市场价格(每张图像0.23-0.27美元)的成本即可生成视觉上可比较的内容,这凸显了在日益融入AI的数字媒体生态中,亟需提升认知并开展关于媒体真实性的战略性讨论。此外,本工作通过构建包含约1900万对由热门Midjourney平台生成的提示-图像数据集,为该领域做出贡献,并计划公开发布该数据集。