Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper's results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.
翻译:大型语言模型(LLMs)在加速和支持学术同行评审方面具有巨大潜力,并越来越多地被用作全自动审稿生成器(ARGs)。然而,潜在的偏见和系统性错误可能对科学诚信构成重大风险;理解最先进ARGs的具体能力和局限性至关重要。我们关注支撑高质量同行评审的一项核心审稿技能:检测错误的研究逻辑。这涉及评估论文结果、解释与主张之间的内在一致性。我们提出了一种全自动反事实评估框架,可在受控条件下分离并测试此项技能。通过对一系列ARG方法进行测试,我们发现,与预期相反,研究逻辑中的缺陷对其生成的审稿意见并无显著影响。基于我们的发现,我们为未来工作提出了三项可操作的建议,并公开发布了我们的反事实数据集与评估框架。