Our study explores how well the state-of-the-art Large Language Models (LLMs), like GPT-4 and Mistral, can assess the quality of scientific summaries or, more fittingly, scientific syntheses, comparing their evaluations to those of human annotators. We used a dataset of 100 research questions and their syntheses made by GPT-4 from abstracts of five related papers, checked against human quality ratings. The study evaluates both the closed-source GPT-4 and the open-source Mistral model's ability to rate these summaries and provide reasons for their judgments. Preliminary results show that LLMs can offer logical explanations that somewhat match the quality ratings, yet a deeper statistical analysis shows a weak correlation between LLM and human ratings, suggesting the potential and current limitations of LLMs in scientific synthesis evaluation.
翻译:本研究探讨了当前最先进的大型语言模型(如GPT-4和Mistral)评估科学摘要(或更准确地说,科学综合)质量的能力,并将其评估结果与人类标注者的评估进行比较。我们使用了一个包含100个研究问题及其对应综合的数据集,这些综合由GPT-4根据五篇相关论文的摘要生成,并与人类质量评分进行对照。研究评估了闭源的GPT-4和开源的Mistral模型在评分这些摘要及提供判断理由方面的能力。初步结果表明,大型语言模型能够提供与质量评分部分吻合的逻辑解释,但更深入的统计分析显示,大型语言模型与人类评分之间的相关性较弱,这揭示了大型语言模型在科学综合评估方面的潜力与当前局限。