Given the rapid ascent of large language models (LLMs), we study the question: (How) can large language models help in reviewing of scientific papers or proposals? We first conduct some pilot studies where we find that (i) GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to identify errors) outperforms prompting to simply write a review. With these insights, we study the use of LLMs (specifically, GPT-4) for three tasks: 1. Identifying errors: We construct 13 short computer science papers each with a deliberately inserted error, and ask the LLM to check for the correctness of these papers. We observe that the LLM finds errors in 7 of them, spanning both mathematical and conceptual errors. 2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist questions in the respective sections of 15 NeurIPS 2022 papers. We find that across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy. 3. Choosing the "better" paper: We generate 10 pairs of abstracts, deliberately designing each pair in such a way that one abstract was clearly superior than the other. The LLM, however, struggled to discern these relatively straightforward distinctions accurately, committing errors in its evaluations for 6 out of the 10 pairs. Based on these experiments, we think that LLMs have a promising use as reviewing assistants for specific reviewing tasks, but not (yet) for complete evaluations of papers or proposals.
翻译:随着大型语言模型(LLM)的迅速崛起,我们探讨以下问题:大型语言模型(如何)能够帮助评审科学论文或研究方案?首先,我们开展试点研究,发现:(i)GPT-4 优于其他 LLM(Bard、Vicuna、Koala、Alpaca、LLaMA、Dolly、OpenAssistant、StableLM);(ii)针对特定问题(例如识别错误)的提示优于简单要求撰写评审意见的提示。基于这些发现,我们研究 LLM(特别是 GPT-4)在三个任务中的应用:1. 错误识别:我们构建了 13 篇简短计算机科学论文,每篇均故意插入一处错误,并要求 LLM 检查这些论文的正确性。结果显示,LLM 发现了其中 7 篇论文中的错误,涵盖数学错误和概念错误。2. 清单核查:我们要求 LLM 对 15 篇 NeurIPS 2022 论文的相应章节进行 16 个封闭式清单问题的验证。在 119 个“清单问题-论文”对中,LLM 的准确率达到 86.6%。3. 选择“更优”论文:我们生成了 10 对摘要,每对精心设计以确保其中一篇摘要明显优于另一篇。然而,LLM 难以准确辨别这些相对直观的差异,在 10 对摘要中有 6 对评估出现错误。基于这些实验,我们认为 LLM 有望作为特定评审任务的辅助工具,但仍不适合(目前)用于对论文或研究方案的全面评估。