The recent revolutionary advance in generative AI enables the generation of realistic and coherent texts by large language models (LLMs). Despite many existing evaluation metrics on the quality of the generated texts, there is still a lack of rigorous assessment of how well LLMs perform in complex and demanding writing assessments. This study examines essays generated by ten leading LLMs for the analytical writing assessment of the Graduate Record Exam (GRE). We assessed these essays using both human raters and the e-rater automated scoring engine as used in the GRE scoring pipeline. Notably, the top-performing Gemini and GPT-4o received an average score of 4.78 and 4.67, respectively, falling between "generally thoughtful, well-developed analysis of the issue and conveys meaning clearly" and "presents a competent analysis of the issue and conveys meaning with acceptable clarity" according to the GRE scoring guideline. We also evaluated the detection accuracy of these essays, with detectors trained on essays generated by the same and different LLMs.
翻译:近期生成式人工智能的革命性进展使得大语言模型能够生成逼真且连贯的文本。尽管现有多种针对生成文本质量的评估指标,但仍缺乏对大语言模型在复杂且要求严苛的写作评估中表现的系统性评价。本研究考察了十种主流大语言模型为研究生入学考试分析性写作评估生成的论文。我们采用GRE评分流程中使用的真人评分员与电子评分引擎对这些论文进行评估。值得注意的是,表现最佳的Gemini与GPT-4o模型分别获得4.78和4.67的平均分,根据GRE评分标准,其水平介于"对议题进行总体深入、充分发展的分析且表达清晰"与"对议题提出合格分析且表达清晰度可接受"之间。我们还评估了这些论文的检测准确率,检测器分别在相同及不同大语言模型生成的论文数据集上进行训练。