We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.
翻译:我们提出了一种方法,用于估算大型语料库中可能被大型语言模型(LLM)大幅修改或生成的文本比例。我们的最大似然模型利用专家撰写的文本和AI生成的参考文本,在语料库层面准确且高效地考察LLM的实际应用。我们将该方法应用于ChatGPT发布后的AI会议科学同行评审案例研究,具体涵盖ICLR 2024、NeurIPS 2023、CoRL 2023和EMNLP 2023。结果显示,在这些会议提交的同行评审文本中,估计有6.5%至16.9%可能被LLM大幅修改(即超出拼写检查或小幅润色范围)。生成文本出现的场景揭示了用户行为特征:在评审信心较低、临近截止日期提交以及回复作者反驳意愿较弱的评审者中,LLM生成文本的估计比例更高。我们还观察到语料层面生成文本的宏观趋势,这些趋势在个体层面可能难以察觉,并讨论了此类趋势对同行评审的影响。我们呼吁开展跨学科研究,以审视LLM使用如何改变我们的信息与知识实践。