A Survey on LLM-gernerated Text Detection: Necessity, Methods, and Future Directions

The powerful ability to understand, follow, and generate complex language emerging from large language models (LLMs) makes LLM-generated text flood many areas of our daily lives at an incredible speed and is widely accepted by humans. As LLMs continue to expand, there is an imperative need to develop detectors that can detect LLM-generated text. This is crucial to mitigate potential misuse of LLMs and safeguard realms like artistic expression and social networks from harmful influence of LLM-generated content. The LLM-generated text detection aims to discern if a piece of text was produced by an LLM, which is essentially a binary classification task. The detector techniques have witnessed notable advancements recently, propelled by innovations in watermarking techniques, zero-shot methods, fine-turning LMs methods, adversarial learning methods, LLMs as detectors, and human-assisted methods. In this survey, we collate recent research breakthroughs in this area and underscore the pressing need to bolster detector research. We also delve into prevalent datasets, elucidating their limitations and developmental requirements. Furthermore, we analyze various LLM-generated text detection paradigms, shedding light on challenges like out-of-distribution problems, potential attacks, and data ambiguity. Conclusively, we highlight interesting directions for future research in LLM-generated text detection to advance the implementation of responsible artificial intelligence (AI). Our aim with this survey is to provide a clear and comprehensive introduction for newcomers while also offering seasoned researchers a valuable update in the field of LLM-generated text detection.

翻译：大语言模型(LLMs)展现出的强大理解、遵循与生成复杂语言的能力，使其生成的文本以惊人的速度涌入日常生活的诸多领域，并被人类广泛接受。随着LLMs的持续扩展，亟需开发能够检测LLM生成文本的检测器。这对减轻LLMs潜在滥用风险、保护艺术表达与社交网络等领域免受LLM生成内容的有害影响至关重要。LLM生成文本检测旨在判断一段文本是否由LLM生成，本质上属于二分类任务。近年来，在水印技术、零样本方法、微调语言模型法、对抗学习法、以LLM作为检测器以及人工辅助方法等创新技术的推动下，检测技术取得了显著进展。本综述系统梳理了该领域的最新研究突破，强调了加强检测器研究的迫切需求。我们深入探讨了主流数据集，阐述了其局限性与发展需求；同时分析了多种LLM生成文本检测范式，揭示了分布外问题、潜在攻击及数据歧义等挑战。最后，我们指出了LLM生成文本检测领域未来研究中有前景的方向，以推动负责任的人工智能(AI)的实施。本综述旨在为新手提供清晰全面的入门指导，同时为资深研究人员提供该领域研究进展的及时更新。