Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating potentially plagiarised 122 scientific document pairs. Second, we analyze the best-performing approaches to detect plagiarism and mathematical content similarity on the newly established taxonomy. We found that the best-performing methods for plagiarism and math content similarity achieve an overall detection score (PlagDet) of 0.06 and 0.16, respectively. The best-performing methods failed to detect most cases from all seven newly established math similarity types. Outlined contributions will benefit research in plagiarism detection systems, recommender systems, question-answering systems, and search engines. We make our experiment's code and annotated dataset available to the community: https://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism
翻译:抄袭是一个紧迫的问题,随着大型语言模型的普及,这一问题变得尤为突出。现有的抄袭检测系统能够可靠地发现直接复制和适度改写的文本,但对于思想抄袭却无能为力,尤其是在大量使用形式化数学符号的数学科学领域。我们做出了两项贡献。首先,我们通过标注122对可能存在抄袭的科学文档,建立了一个数学内容重用的分类体系。其次,我们在新建立的分类体系上分析了检测抄袭和数学内容相似性的最佳方法。我们发现,针对抄袭和数学内容相似性的最佳方法的总体检测分数(PlagDet)分别仅为0.06和0.16。这些最佳方法未能检测出我们新建立的七种数学相似性类型中的大多数情况。概述的贡献将有益于抄袭检测系统、推荐系统、问答系统和搜索引擎的研究。我们将实验代码和标注的数据集公开给社区:https://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism