Text plagiarism detection task is a common natural language processing task that aims to detect whether a given text contains plagiarism or copying from other texts. In existing research, detection of high level plagiarism is still a challenge due to the lack of high quality datasets. In this paper, we propose a plagiarized text data generation method based on GPT-3.5, which produces 32,927 pairs of text plagiarism detection datasets covering a wide range of plagiarism methods, bridging the gap in this part of research. Meanwhile, we propose a plagiarism identification method based on Faiss with BERT with high efficiency and high accuracy. Our experiments show that the performance of this model outperforms other models in several metrics, including 98.86\%, 98.90%, 98.86%, and 0.9888 for Accuracy, Precision, Recall, and F1 Score, respectively. At the end, we also provide a user-friendly demo platform that allows users to upload a text library and intuitively participate in the plagiarism analysis.
翻译:文本抄袭检测任务是一项常见的自然语言处理任务,旨在检测给定文本是否包含对其他文本的抄袭或复制行为。在现有研究中,由于缺乏高质量数据集,高级别抄袭的检测仍面临挑战。本文提出了一种基于GPT-3.5的抄袭文本数据生成方法,该方法生成了32,927对涵盖广泛抄袭方式的文本抄袭检测数据集,填补了该领域的研究空白。同时,我们提出了一种基于Faiss与BERT的高效高精度抄袭识别方法。实验表明,该模型在多项指标上均优于其他模型,其中准确率、精确率、召回率和F1分数分别达到98.86%、98.90%、98.86%和0.9888。最后,我们还提供了一个用户友好的演示平台,允许用户上传文本库并直观地参与抄袭分析。