Large language models (LLMs) represent a promising, but controversial, tool in aiding scientific peer review. This study evaluates the usefulness of LLMs in a conference setting as a tool for vetting paper submissions against submission standards. We conduct an experiment at the 2024 Neural Information Processing Systems (NeurIPS) conference, where 234 papers were voluntarily submitted to an "LLM-based Checklist Assistant." This assistant validates whether papers adhere to the author checklist used by NeurIPS, which includes questions to ensure compliance with research and manuscript preparation standards. Evaluation of the assistant by NeurIPS paper authors suggests that the LLM-based assistant was generally helpful in verifying checklist completion. In post-usage surveys, over 70% of authors found the assistant useful, and 70% indicate that they would revise their papers or checklist responses based on its feedback. While causal attribution to the assistant is not definitive, qualitative evidence suggests that the LLM contributed to improving some submissions. Survey responses and analysis of re-submissions indicate that authors made substantive revisions to their submissions in response to specific feedback from the LLM. The experiment also highlights common issues with LLMs: inaccuracy (20/52) and excessive strictness (14/52) were the most frequent issues flagged by authors. We also conduct experiments to understand potential gaming of the system, which reveal that the assistant could be manipulated to enhance scores through fabricated justifications, highlighting potential vulnerabilities of automated review tools.
翻译:大型语言模型(LLMs)在辅助科学同行评审中是一种前景广阔但存在争议的工具。本研究评估了LLMs在会议环境中作为一种工具,用于根据投稿标准审核论文提交的实用性。我们在2024年神经信息处理系统大会(NeurIPS)上进行了一项实验,其中234篇论文自愿提交给一个“基于LLM的检查清单助手”。该助手验证论文是否遵循NeurIPS使用的作者检查清单,该清单包含一系列问题以确保符合研究和稿件准备标准。NeurIPS论文作者对该助手的评估表明,基于LLM的助手在验证检查清单完成情况方面总体上有帮助。在使用后调查中,超过70%的作者认为该助手有用,70%的作者表示会根据其反馈修改论文或检查清单回答。虽然对助手的因果归因并非确定性的,但定性证据表明LLM有助于改进部分投稿。调查回复和重新提交的分析表明,作者根据LLM的具体反馈对其提交内容进行了实质性修改。该实验也凸显了LLMs的常见问题:不准确性(20/52)和过度严格(14/52)是作者指出的最常见问题。我们还进行了实验以了解潜在的系统博弈行为,结果表明该助手可能被通过编造理由来提升分数的行为所操纵,这突显了自动化评审工具的潜在脆弱性。