Large Language Models (LLMs) have the potential to be used to support research evaluation and have a moderate capability to estimate the research quality of a journal article from its title and abstract. This paper assesses whether there are language-related factors unrelated to the quality of the research that influence ChatGPT's scores. Using a dataset of 99,277 journal articles submitted to the UK-wide Research Excellence Framework (REF) 2021 assessments, we calculated several readability indicators from abstracts and correlated them with ChatGPT scores and departmental REF scores. From the results, linguistic complexity and length were more strongly associated with ChatGPT research quality scores than with REF expert scores in many subject areas. Although cause-and-effect was not tested, these results suggest that ChatGPT may be more likely than human experts to reward linguistic complexity, with a potential bias towards longer and less readable abstracts in many fields. The apparent preference of LLMs for complex language is an undesirable feature for practical applications of LLMs for research quality evaluation, unless solutions can be found.
翻译:大语言模型(LLMs)具备支持研究评估的潜力,并能通过论文标题和摘要对其研究质量进行中等程度的预估。本文旨在探讨是否存在与研究质量无关的语言相关因素影响ChatGPT的评分。我们利用提交至英国全国研究卓越框架(REF)2021评估的99,277篇期刊论文数据集,计算摘要的多项可读性指标,并将其与ChatGPT评分及院系REF评分进行关联分析。结果表明,在众多学科领域,语言复杂度和文本长度与ChatGPT研究质量评分的关联性显著强于与REF专家评分的关联性。尽管未验证因果关系,但这一发现表明:相较于人类专家,ChatGPT可能更倾向于对语言复杂度给予较高评价,并在多个领域存在偏好较长但可读性较低摘要的潜在偏差。大语言模型对复杂语言的明显偏好,成为其应用于研究质量评估时的非理想特征——除非能找到相应解决方案。