大型语言模型用于院系专家评审质量评分 (Large Language Models for Departmental Expert Review Quality Scores)

Presumably, peer reviewers and Large Language Models (LLMs) do very different things when asked to assess research. Still, recent evidence has shown that LLMs have a moderate ability to predict quality scores of published academic journal articles. One untested potential application of LLMs is for internal departmental review, which may be used to support appointment and promotion decisions or to select outputs for national assessments. This study assesses for the first time the extent to which (1) LLM quality scores align with internal departmental quality ratings and (2) LLM reports differ from expert reports. Using a private dataset of 58 published journal articles from the School of Information at the University of Sheffield, together with internal departmental quality ratings and reports, ChatGPT-4o, ChatGPT-4o mini, and Gemini 2.0 Flash scores correlate positively and moderately with internal departmental ratings, whether the input is just title/abstract or the full text. Whilst departmental reviews tended to be more specific and showing field-level knowledge, ChatGPT reports tended to be standardised, more general, repetitive, and with unsolicited suggestions for improvement. The results therefore (a) confirm the ability of LLMs to guess the quality scores of published academic research moderately well, (b) confirm that this ability is a guess rather than an evaluation (because it can be made based on title/abstract alone), (c) extend this ability to internal departmental expert review, and (d) show that LLM reports are less insightful than human expert reports for published academic journal articles.

翻译：通常，同行评审专家与大型语言模型（LLM）在评估研究时采用截然不同的方法。然而，近期证据表明，LLM在预测已发表学术期刊文章的质量评分方面具备中等程度的能力。LLM一个尚未验证的潜在应用场景是内部院系评审，该评审可用于支持聘任与晋升决策，或为国家评估遴选成果。本研究首次评估了以下两个方面：(1) LLM质量评分与内部院系质量评级之间的吻合程度；(2) LLM报告与专家报告的差异。利用谢菲尔德大学信息学院58篇已发表期刊文章的私有数据集，结合内部院系质量评级与报告，研究发现：无论输入仅为标题/摘要还是全文，ChatGPT-4o、ChatGPT-4o mini和Gemini 2.0 Flash的评分均与内部院系评级呈中等程度的正相关。尽管院系评审往往更具针对性并展现领域知识，ChatGPT报告则趋于标准化、更泛化、存在重复性，且常包含未经请求的改进建议。因此，研究结果（a）证实了LLM能够较好地推测已发表学术研究的质量评分；（b）证实这种能力属于推测而非评估（因其仅基于标题/摘要即可实现）；（c）将该能力延伸至内部院系专家评审场景；（d）表明对于已发表的学术期刊文章，LLM报告在洞察力方面不及人类专家报告。