While large language models (LLMs) excel at many domain-specific tasks, their ability to deeply comprehend and reason about full-length academic papers remains underexplored. Existing benchmarks often fall short of capturing such depth, either due to surface-level question design or unreliable evaluation metrics. To address this gap, we introduce ELAIPBench, a benchmark curated by domain experts to evaluate LLMs' comprehension of artificial intelligence (AI) research papers. Developed through an incentive-driven, adversarial annotation process, ELAIPBench features 403 multiple-choice questions from 137 papers. It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval. Our experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance. Moreover, we observe that frontier LLMs equipped with a thinking mode or a retrieval-augmented generation (RAG) system fail to improve final results-even harming accuracy due to overthinking or noisy retrieval. These findings underscore the significant gap between current LLM capabilities and genuine comprehension of academic papers.
翻译:尽管大型语言模型(LLM)在许多特定领域任务中表现出色,但其对完整学术论文的深度理解与推理能力仍未得到充分探索。现有评测基准往往因问题设计流于表面或评估指标不可靠,难以捕捉这种深度理解能力。为填补这一空白,我们推出了ELAIPBench——一个由领域专家构建的评测基准,用于评估LLM对人工智能(AI)研究论文的理解能力。该基准通过激励驱动的对抗性标注流程开发,包含来自137篇论文的403道选择题,涵盖三个难度等级,并强调非浅层检索的实质性推理。实验表明,性能最佳的LLM准确率仅为39.95%,远低于人类水平。此外,我们发现配备思维模式或检索增强生成(RAG)系统的前沿LLM未能提升最终结果——甚至因过度思考或噪声检索而导致准确率下降。这些发现凸显了当前LLM能力与真正学术论文理解水平之间存在的显著差距。