利用ChatGPT评估已发表医学研究质量 (Evaluating the quality of published medical research with ChatGPT)

Estimating the quality of published research is important for evaluations of departments, researchers, and job candidates. Citation-based indicators sometimes support these tasks, but do not work for new articles and have low or moderate accuracy. Previous research has shown that ChatGPT can estimate the quality of research articles, with its scores correlating positively with an expert scores proxy in all fields, and often more strongly than citation-based indicators, except for clinical medicine. ChatGPT scores may therefore replace citation-based indicators for some applications. This article investigates the clinical medicine anomaly with the largest dataset yet and a more detailed analysis. The results showed that ChatGPT 4o-mini scores for articles submitted to the UK's Research Excellence Framework (REF) 2021 Unit of Assessment (UoA) 1 Clinical Medicine correlated positively (r=0.134, n=9872) with departmental mean REF scores, against a theoretical maximum correlation of r=0.226. ChatGPT 4o and 3.5 turbo also gave positive correlations. At the departmental level, mean ChatGPT scores correlated more strongly with departmental mean REF scores (r=0.395, n=31). For the 100 journals with the most articles in UoA 1, their mean ChatGPT score correlated strongly with their REF score (r=0.495) but negatively with their citation rate (r=-0.148). Journal and departmental anomalies in these results point to ChatGPT being ineffective at assessing the quality of research in prestigious medical journals or research directly affecting human health, or both. Nevertheless, the results give evidence of ChatGPT's ability to assess research quality overall for Clinical Medicine, where it might replace citation-based indicators for new research.

翻译：评估已发表研究的质量对于部门、研究人员及职位候选人的评价至关重要。基于引用的指标有时能支持这些任务，但对新近发表的文章无效，且准确度较低或中等。先前研究表明，ChatGPT能够评估研究文章的质量，其评分在所有学科领域均与专家评分代理呈正相关，且通常比基于引用的指标相关性更强（临床医学除外）。因此，在某些应用场景中，ChatGPT评分可能替代基于引用的指标。本文使用迄今最大规模的数据集和更细致的分析方法，针对临床医学领域的异常现象展开研究。结果显示，针对英国2021年研究卓越框架（REF）评估单元（UoA）1临床医学领域提交的文章，ChatGPT 4o-mini评分与部门平均REF评分呈正相关（r=0.134，n=9872），理论最大相关系数为r=0.226。ChatGPT 4o和3.5 turbo版本同样呈现正相关。在部门层面，ChatGPT平均评分与部门平均REF评分的相关性更强（r=0.395，n=31）。对UoA 1中收录文章最多的100种期刊，其平均ChatGPT评分与REF评分高度相关（r=0.495），但与引用率呈负相关（r=-0.148）。这些结果中期刊与部门的异常表现表明，ChatGPT在评估权威医学期刊或直接影响人类健康的研究质量方面存在局限，或两者兼有。尽管如此，结果仍证明ChatGPT具备整体评估临床医学研究质量的能力，在该领域或可替代基于引用的指标用于新研究的评估。