The recent revolutionary advance in generative AI enables the generation of realistic and coherent texts by large language models (LLMs). Despite many existing evaluation metrics on the quality of the generated texts, there is still a lack of rigorous assessment of how well LLMs perform in complex and demanding writing assessments. This study examines essays generated by ten leading LLMs for the analytical writing assessment of the Graduate Record Exam (GRE). We assessed these essays using both human raters and the e-rater automated scoring engine as used in the GRE scoring pipeline. Notably, the top-performing Gemini and GPT-4o received an average score of 4.78 and 4.67, respectively, falling between "generally thoughtful, well-developed analysis of the issue and conveys meaning clearly" and "presents a competent analysis of the issue and conveys meaning with acceptable clarity" according to the GRE scoring guideline. We also evaluated the detection accuracy of these essays, with detectors trained on essays generated by the same and different LLMs.
翻译:近期生成式人工智能的革命性进展使得大语言模型能够生成逼真且连贯的文本。尽管现有多种针对生成文本质量的评估指标,但仍缺乏对大语言模型在复杂且要求严格的写作评估中表现如何的严谨评估。本研究考察了十种领先的大语言模型为研究生入学考试分析性写作评估所生成的文章。我们同时采用人工评分员和GRE评分流程中使用的e-rater自动评分引擎对这些文章进行了评估。值得注意的是,表现最佳的Gemini和GPT-4o分别获得了4.78和4.67的平均分,根据GRE评分标准,该分数介于"对议题进行了总体深思熟虑、充分展开的分析,且能清晰传达含义"与"对议题进行了合格的分析,并以可接受的清晰度传达含义"之间。我们还评估了这些文章的检测准确率,检测器分别在由相同及不同大语言模型生成的文章上进行训练。