This article gives an overview of the field of LLM text recognition. Different approaches and implemented detectors for the recognition of LLM-generated text are presented. In addition to discussing the implementations, the article focuses on benchmarking the detectors. Although there are numerous software products for the recognition of LLM-generated text, with a focus on ChatGPT-like LLMs, the quality of the recognition (recognition rate) is not clear. Furthermore, while it can be seen that scientific contributions presenting their novel approaches strive for some kind of comparison with other approaches, the construction and independence of the evaluation dataset is often not comprehensible. As a result, discrepancies in the performance evaluation of LLM detectors are often visible due to the different benchmarking datasets. This article describes the creation of an evaluation dataset and uses this dataset to investigate the different detectors. The selected detectors are benchmarked against each other.
翻译:本文综述了大语言模型文本识别领域的研究现状。文中介绍了识别大语言模型生成文本的不同方法及已实现的检测器。除讨论具体实现外,本文重点对检测器进行基准测试。尽管目前存在大量针对类ChatGPT大语言模型的文本识别软件产品,但其识别质量(识别率)尚不明确。此外,虽然可见科学贡献在展示新方法时力求与其他方法进行比较,但评估数据集的构建独立性与可复现性常不明确。由此导致因基准数据集差异而产生的检测器性能评估分歧屡见不鲜。本文详细描述了评估数据集的构建过程,并利用该数据集对多种检测器进行实证研究。所选检测器在统一基准下进行了相互比较测试。