The rapid progress of Natural Language Processing (NLP) technologies has led to the widespread availability and effectiveness of text generation tools such as ChatGPT and Claude. While highly useful, these technologies also pose significant risks to the credibility of various media forms if they are employed for paraphrased plagiarism -- one of the most subtle forms of content misuse in scientific literature and general text media. Although automated methods for paraphrase identification have been developed, detecting this type of plagiarism remains challenging due to the inconsistent nature of the datasets used to train these methods. In this article, we examine traditional and contemporary approaches to paraphrase identification, investigating how the under-representation of certain paraphrase types in popular datasets, including those used to train Large Language Models (LLMs), affects the ability to detect plagiarism. We introduce and validate a new refined typology for paraphrases (ReParaphrased, REfined PARAPHRASE typology definitions) to better understand the disparities in paraphrase type representation. Lastly, we propose new directions for future research and dataset development to enhance AI-based paraphrase detection.
翻译:自然语言处理(NLP)技术的快速发展,使得诸如ChatGPT和Claude等文本生成工具得到广泛普及并展现出显著效能。尽管这些技术极具实用性,但若将其用于复述式剽窃——科学文献与一般文本媒体中最隐蔽的内容滥用形式之一——也会对各种媒体形式的可信度构成重大风险。虽然自动化的复述识别方法已被开发出来,但由于训练这些方法所用数据集的性质不一致,检测此类剽窃行为仍然具有挑战性。本文审视了传统与当代的复述识别方法,探究了在包括用于训练大语言模型(LLMs)的流行数据集中,某些复述类型的代表性不足如何影响剽窃检测能力。我们引入并验证了一种新的精细化复述类型学(ReParaphrased,即精细化复述类型定义),以更好地理解复述类型表征的差异。最后,我们为未来研究和数据集开发提出了新的方向,以增强基于人工智能的复述检测能力。