We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.
翻译:我们提出AEGIS,一个用于评估AI生成学术图像取证的综合性基准。与现有基准相比,AEGIS具有三个关键进展:(1)领域特定复杂性:覆盖七个学术类别共39个细分子类,揭示了内在的取证难度,即使GPT-5.1的整体性能也仅为48.80%,而专家模型仅实现有限的定位精度(IoU 30.09%);(2)多样化伪造模拟:在25个生成模型中模拟了四种常见的学术伪造策略,其中11个模型的平均取证准确率低于50%,表明取证技术落后于生成技术的进步;(3)多维取证评估:联合评估检测、推理和定位能力,揭示了不同模型家族间的互补优势——多模态大语言模型(MLLM)在文本伪影识别中准确率达84.74%,而专家模型在二元真伪检测中峰值准确率为79.54%。通过评估25个领先的MLLM、9个专家模型以及一个统一的多模态理解与生成模型,AEGIS作为诊断性测试平台,暴露了学术图像取证领域的根本性局限。