Can large language models provide useful feedback on research papers? A large-scale empirical analysis

Expert feedback lays the foundation of rigorous research. However, the rapid growth of scholarly production and intricate knowledge specialization challenge the conventional scientific feedback mechanisms. High-quality peer reviews are increasingly difficult to obtain. Researchers who are more junior or from under-resourced settings have especially hard times getting timely feedback. With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback on research manuscripts. However, the utility of LLM-generated feedback has not been systematically studied. To address this gap, we created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers. We evaluated the quality of GPT-4's feedback through two large-scale studies. We first quantitatively compared GPT-4's generated feedback with human peer reviewer feedback in 15 Nature family journals (3,096 papers in total) and the ICLR machine learning conference (1,709 papers). The overlap in the points raised by GPT-4 and by human reviewers (average overlap 30.85% for Nature journals, 39.23% for ICLR) is comparable to the overlap between two human reviewers (average overlap 28.58% for Nature journals, 35.25% for ICLR). The overlap between GPT-4 and human reviewers is larger for the weaker papers. We then conducted a prospective user study with 308 researchers from 110 US institutions in the field of AI and computational biology to understand how researchers perceive feedback generated by our GPT-4 system on their own papers. Overall, more than half (57.4%) of the users found GPT-4 generated feedback helpful/very helpful and 82.4% found it more beneficial than feedback from at least some human reviewers. While our findings show that LLM-generated feedback can help researchers, we also identify several limitations.

翻译：专家反馈是严谨研究的基础。然而，学术产出的快速增长和知识结构的日益细化，对传统科学反馈机制构成了挑战。高质量同行评审愈发难以获得，尤其是资历较浅或资源匮乏的研究者，更难及时获取反馈。随着GPT-4等大型语言模型（LLM）的突破性进展，利用LLM为科研手稿生成科学反馈的探索日益兴起。然而，LLM生成反馈的实际效用尚未得到系统研究。为填补这一空白，我们构建了基于GPT-4的全自动流程，用于对科学论文PDF全文提供评注。通过两项大规模研究评估了GPT-4反馈的质量：首先，我们定量比较了GPT-4在15种《自然》系列期刊（共3,096篇论文）及ICLR机器学习会议（1,709篇论文）中生成的反馈与人类同行评审的差异。结果显示，GPT-4与人类评审员提出观点的重叠率（《自然》系列期刊平均30.85%，ICLR平均39.23%）与两位人类评审员之间的重叠率（《自然》系列期刊平均28.58%，ICLR平均35.25%）相当。对于质量较弱的论文，GPT-4与人类评审员的重叠率更高。随后，我们对来自美国110所机构的308名AI与计算生物学领域研究者开展了前瞻性用户研究，探究研究者对GPT-4系统生成的针对自身论文反馈的认知。总体而言，超过半数（57.4%）用户认为GPT-4生成的反馈"有帮助/非常有帮助"，82.4%的用户认为其反馈优于至少部分人类评审员。尽管我们的研究发现表明LLM生成的反馈能够帮助研究者，但也识别出若干局限性。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日