Text plagiarism detection task is a common natural language processing task that aims to detect whether a given text contains plagiarism or copying from other texts. In existing research, detection of high level plagiarism is still a challenge due to the lack of high quality datasets. In this paper, we propose a plagiarized text data generation method based on GPT-3.5, which produces 32,927 pairs of text plagiarism detection datasets covering a wide range of plagiarism methods, bridging the gap in this part of research. Meanwhile, we propose a plagiarism identification method based on Faiss with BERT with high efficiency and high accuracy. Our experiments show that the performance of this model outperforms other models in several metrics, including 98.86\%, 98.90%, 98.86%, and 0.9888 for Accuracy, Precision, Recall, and F1 Score, respectively. At the end, we also provide a user-friendly demo platform that allows users to upload a text library and intuitively participate in the plagiarism analysis.
翻译:文本抄袭检测任务是一项常见的自然语言处理任务,旨在检测给定文本是否存在从其他文本中抄袭或复制的内容。在现有研究中,由于缺乏高质量数据集,高级别抄袭的检测仍是一大挑战。本文提出了一种基于GPT-3.5的抄袭文本数据生成方法,生成了涵盖多种抄袭方式的32,927对文本抄袭检测数据集,填补了该部分研究的空白。同时,我们提出了一种基于Faiss与BERT的高效且高精度的抄袭识别方法。实验表明,该模型在多项指标上均优于其他模型,准确率、精确率、召回率和F1分数分别达到98.86%、98.90%、98.86%和0.9888。最后,我们还提供了一个用户友好的演示平台,允许用户上传文本库并直观地参与抄袭分析。