This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI-driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI-written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI-written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025. Our experiments reveal a clear trade-off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI-driven paper writing and improving the understanding of its risks within the research community.
翻译:本文提出了首个系统化评估框架,用于量化现代编码代理所撰写论文的质量与风险。尽管AI驱动的论文撰写已成为日益严峻的问题,但针对AI生成论文质量与潜在风险的严谨评估仍十分有限,学界对其可靠性的统一认知依然欠缺。我们引入论文重构评估(PaperRecon)框架,其流程为:从现有论文中生成概览文件(overview.md),随后由智能体基于该概览与最少附加资源生成完整论文,最后将生成结果与原始论文进行对比。PaperRecon将AI生成论文的评估解耦为两个正交维度——表现与幻觉:表现维度通过评分细则进行评估,幻觉维度则依托原始论文来源进行智能体评估。为实施评估,我们构建了PaperWrite-Bench基准测试集,涵盖2025年后发表于顶级学术会议、覆盖多学科的51篇论文。实验揭示出明确权衡关系:ClaudeCode和Codex均随模型进步而性能提升,但ClaudeCode在实现更高表现质量的同时,平均每篇论文产生超过10次幻觉;而Codex虽产生更少幻觉,但其表现质量较低。本研究迈出了构建AI驱动论文撰写评估框架的第一步,有助于研究社区深化对其风险的理解。