This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs. human-authored essays for academic purposes. The task is defined as follows: "Given an essay, identify whether it is generated by a machine or authored by a human.'' The challenge involves two languages: English and Arabic. During the evaluation phase, 25 teams submitted systems for English and 21 teams for Arabic, reflecting substantial interest in the task. Finally, seven teams submitted system description papers. The majority of submissions utilized fine-tuned transformer-based models, with one team employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This paper outlines the task formulation, details the dataset construction process, and explains the evaluation framework. Additionally, we present a summary of the approaches adopted by participating teams. Nearly all submitted systems outperformed the n-gram-based baseline, with the top-performing systems achieving F1 scores exceeding 0.98 for both languages, indicating significant progress in the detection of machine-generated text.
翻译:本文全面概述了首届学术论文真实性挑战赛,该赛事作为与COLING 2025联合举办的GenAI内容检测共享任务的一部分。本挑战赛聚焦于检测学术用途的机器生成与人工撰写的论文。任务定义如下:"给定一篇论文,判断其由机器生成还是人类创作。"挑战涵盖两种语言:英语和阿拉伯语。在评估阶段,共有25支团队提交了英语系统,21支团队提交了阿拉伯语系统,反映出对该任务的广泛关注。最终,七支团队提交了系统描述论文。绝大多数提交方案采用基于Transformer的微调模型,其中一支团队使用了Llama 2和Llama 3等大型语言模型。本文阐述了任务框架,详述了数据集构建过程,并解释了评估体系。此外,我们总结了参赛团队采用的技术方案。几乎所有提交系统的性能均超越基于n-gram的基线模型,最优系统在两种语言上的F1分数均超过0.98,标志着机器生成文本检测领域取得重要进展。