We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.
翻译:我们提出了一种方法,用于估计大规模文本语料库中可能被大型语言模型(LLM)实质性修改或生成文本的比例。我们的最大似然模型利用专家撰写和AI生成的参考文本,在语料库层面准确高效地检测现实世界中的LLM使用情况。我们将此方法应用于ChatGPT发布后AI会议科学同行评审的案例研究,包括ICLR 2024、NeurIPS 2023、CoRL 2023和EMNLP 2023。研究结果表明,提交至这些会议的同行评审文本中,约有6.5%至16.9%可能被LLM实质性修改(即超越拼写检查或轻微写作润色)。生成文本出现的场景揭示了用户行为规律:在自报置信度较低、临近截止日期提交、以及评审人较少回应作者反驳意见的评审中,LLM生成文本的估计比例更高。我们还观察到生成文本在语料库层面呈现的趋势——这些趋势在个体层面可能过于细微而难以察觉,并讨论了此类趋势对同行评审体系的影响。我们呼吁未来开展跨学科研究,以深入探究LLM使用如何改变我们的信息处理与知识实践模式。