In which fields do ChatGPT scores align better than citations with research quality?

Although citation-based indicators are widely used for research evaluation, they are not useful for recently published research, reflect only one of the three common dimensions of research quality, and have little value in some social sciences, arts and humanities. Large Language Models (LLMs) have been shown to address some of these weaknesses, with ChatGPT-4o mini showing the most promising results, although on incomplete data. This article reports by far the largest scale evaluation of ChatGPT-4o mini yet and also evaluates its larger sibling ChatGPT-4o and ChatGPT-5 mini. Based on comparisons between LLM scores, averaged over 5 repetitions, and departmental average quality scores for 107,212 UK-based refereed journal articles, ChatGPT-4o is marginally better than ChatGPT-4o mini in most of the 34 field-based Units of Assessment (UoAs) tested, although combining both gives better results than either one. ChatGPT-4o scores have a positive correlation with research quality in 33 of the 34 UoAs, with the results being statistically significant in 31. The most substantial exception is Physics, for which citations are more useful. ChatGPT-4o scores had a higher correlation with research quality than long term citation rates in 21 out of 34 UoAs and a higher correlation than short term citation rates in 26 out of 34 UoAs. ChatGPT-5 mini has even stronger correlations overall. In summary, the results give the first large scale evidence that ChatGPT-4o and ChatGPT-5 mini are competitive with citations as new research quality indicator sources.

翻译：尽管基于引用的指标被广泛用于研究评估，但它们不适用于近期发表的研究成果，仅能反映研究质量三个常见维度中的一个，并且在某些社会科学、艺术与人文学科中价值有限。大型语言模型（LLMs）已被证明能够弥补其中部分缺陷，其中ChatGPT-4o mini展现出最具前景的结果（尽管基于不完整数据）。本文报道了迄今为止规模最大的ChatGPT-4o mini评估研究，同时评估了其更大规模的姊妹模型ChatGPT-4o与ChatGPT-5 mini。基于LLM评分（经5次重复计算平均值）与107,212篇英国同行评审期刊文章的部门平均质量评分的比较，在测试的34个基于领域的评估单元中，ChatGPT-4o在大多数单元中略优于ChatGPT-4o mini，但两者结合使用能获得比单一模型更好的结果。ChatGPT-4o评分在34个评估单元中的33个与研究质量呈正相关，其中31个单元的结果具有统计学显著性。最显著的例外是物理学领域，在该领域引用指标更具参考价值。在34个评估单元中，ChatGPT-4o评分与长期引用率相比在21个单元具有更高相关性，与短期引用率相比在26个单元具有更高相关性。ChatGPT-5 mini整体展现出更强的相关性。综上所述，本研究首次通过大规模证据表明，ChatGPT-4o与ChatGPT-5 mini作为新兴研究质量指标来源，已具备与引用指标相竞争的能力。