Our work addresses the critical issue of distinguishing text generated by Large Language Models (LLMs) from human-produced text, a task essential for numerous applications. Despite ongoing debate about the feasibility of such differentiation, we present evidence supporting its consistent achievability, except when human and machine text distributions are indistinguishable across their entire support. Drawing from information theory, we argue that as machine-generated text approximates human-like quality, the sample size needed for detection increases. We establish precise sample complexity bounds for detecting AI-generated text, laying groundwork for future research aimed at developing advanced, multi-sample detectors. Our empirical evaluations across multiple datasets (Xsum, Squad, IMDb, and Kaggle FakeNews) confirm the viability of enhanced detection methods. We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero. Our findings align with OpenAI's empirical data related to sequence length, marking the first theoretical substantiation for these observations.
翻译:本研究聚焦于区分大型语言模型(LLMs)生成文本与人类撰写文本这一关键问题,该任务对众多应用具有重要意义。尽管学界对二者可区分性存在持续争议,但我们论证了:除非人类与机器文本分布在整体支撑集上完全不可区分,否则检测结果始终具有一致性。基于信息论视角,我们指出当机器生成文本逼近人类质量时,检测所需样本量将随之增加。我们建立了AI生成文本检测的精确样本复杂度边界,为开发先进的多样本检测器奠定理论基础。在Xsum、Squad、IMDb及Kaggle FakeNews等多个数据集上的实证评估验证了增强检测方法的有效性。我们测试了包括GPT-2、GPT-3.5-Turbo、Llama、Llama-2-13B-Chat-HF及Llama-2-70B-Chat-HF在内的多种先进文本生成器,并与oBERTa-Large/Base-Detector、GPTZero等检测器进行对比。实验结果与OpenAI关于序列长度的实证数据相吻合,首次为这些观测结果提供了理论依据。