Paraphrase Identification with Deep Learning: A Review of Datasets and Methods

The rapid progress of Natural Language Processing (NLP) technologies has led to the widespread availability and effectiveness of text generation tools such as ChatGPT and Claude. While highly useful, these technologies also pose significant risks to the credibility of various media forms if they are employed for paraphrased plagiarism -- one of the most subtle forms of content misuse in scientific literature and general text media. Although automated methods for paraphrase identification have been developed, detecting this type of plagiarism remains challenging due to the inconsistent nature of the datasets used to train these methods. In this article, we examine traditional and contemporary approaches to paraphrase identification, investigating how the under-representation of certain paraphrase types in popular datasets, including those used to train Large Language Models (LLMs), affects the ability to detect plagiarism. We introduce and validate a new refined typology for paraphrases (ReParaphrased, REfined PARAPHRASE typology definitions) to better understand the disparities in paraphrase type representation. Lastly, we propose new directions for future research and dataset development to enhance AI-based paraphrase detection.

翻译：自然语言处理（NLP）技术的快速发展，使得诸如ChatGPT和Claude等文本生成工具得到广泛普及并展现出显著效能。尽管这些技术极具实用性，但若将其用于复述式剽窃——科学文献与一般文本媒体中最隐蔽的内容滥用形式之一——也会对各种媒体形式的可信度构成重大风险。虽然自动化的复述识别方法已被开发出来，但由于训练这些方法所用数据集的性质不一致，检测此类剽窃行为仍然具有挑战性。本文审视了传统与当代的复述识别方法，探究了在包括用于训练大语言模型（LLMs）的流行数据集中，某些复述类型的代表性不足如何影响剽窃检测能力。我们引入并验证了一种新的精细化复述类型学（ReParaphrased，即精细化复述类型定义），以更好地理解复述类型表征的差异。最后，我们为未来研究和数据集开发提出了新的方向，以增强基于人工智能的复述检测能力。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日