Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine-tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts - with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state-of-the-art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi-component architecture with in-context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types - object morphology, animal anatomy, and entity interactions - and to the distinct task of AIGC detection.
翻译:现代图像生成器能产生极其逼真的图像,只有扭曲的手或变形的物体等伪影会揭示其合成来源。检测这些伪影至关重要:若无检测,我们便无法对生成器进行基准测试,也无法训练奖励模型以改进它们。当前的检测器通常在数万张标注图像上微调视觉语言模型,但每当生成器演进或新的伪影类型出现时,重复这一过程成本高昂。我们证明,预训练的视觉语言模型已编码了检测伪影所需的知识——通过合适的框架,仅需每个伪影类别数百个标注示例即可解锁此能力。我们的系统 ArtifactLens 在五个人工伪影基准测试中(首个跨多个数据集的评估)达到了最先进的性能,同时所需的标注数据量减少了数个数量级。该框架包含一个具有上下文学习和文本指令优化的多组件架构,并对各部分进行了新颖的改进。我们的方法可推广至其他伪影类型——物体形态、动物解剖结构和实体交互——以及人工智能生成内容检测这一不同任务。