Revising scientific papers based on peer feedback is a challenging task that requires not only deep scientific knowledge and reasoning, but also the ability to recognize the implicit requests in high-level feedback and to choose the best of many possible ways to update the manuscript in response. We introduce this task for large language models and release ARIES, a dataset of review comments and their corresponding paper edits, to enable training and evaluating models. We study two versions of the task: comment-edit alignment and edit generation, and evaluate several baselines, including GPT-4. We find that models struggle even to identify the edits that correspond to a comment, especially in cases where the comment is phrased in an indirect way or where the edit addresses the spirit of a comment but not the precise request. When tasked with generating edits, GPT-4 often succeeds in addressing comments on a surface level, but it rigidly follows the wording of the feedback rather than the underlying intent, and includes fewer technical details than human-written edits. We hope that our formalization, dataset, and analysis will form a foundation for future work in this area.
翻译:根据同行反馈修改科学论文是一项具有挑战性的任务,不仅需要深厚的科学知识与推理能力,还需识别高层级反馈中的隐含诉求,并在众多可能的文稿更新方案中做出最优选择。我们针对大语言模型提出该任务,并发布包含评审意见及其对应论文编辑的数据集ARIES,以支持模型训练与评估。我们研究该任务的两种形式:意见-编辑对齐与编辑生成,并评估包括GPT-4在内的多个基线模型。研究发现:模型在识别与意见对应的编辑时表现困难,尤其当意见以间接方式表述或编辑虽回应意见精神却未严格遵循具体诉求时。在生成编辑任务中,GPT-4虽能表面性回应意见,但机械遵循反馈措辞而忽略潜在意图,且技术细节密度低于人类编写的编辑。我们希望本研究提出的任务形式化、数据集及分析能为该领域未来工作奠定基础。