Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating potentially plagiarised 122 scientific document pairs. Second, we analyze the best-performing approaches to detect plagiarism and mathematical content similarity on the newly established taxonomy. We found that the best-performing methods for plagiarism and math content similarity achieve an overall detection score (PlagDet) of 0.06 and 0.16, respectively. The best-performing methods failed to detect most cases from all seven newly established math similarity types. Outlined contributions will benefit research in plagiarism detection systems, recommender systems, question-answering systems, and search engines. We make our experiment's code and annotated dataset available to the community: https://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism
翻译:剽窃是一个紧迫的问题,尤其是在大语言模型可用的情况下更是如此。现有的剽窃检测系统能够可靠地发现直接复制和中等程度改写的文本,但在思想剽窃方面表现不佳,特别是在大量使用形式化数学符号的数理科学中。我们做出两项贡献。首先,我们通过标注122篇潜在剽窃的科学文献对,建立了数学内容复用的分类学。其次,我们分析了基于新建立的分类学检测剽窃和数学内容相似性的最佳性能方法。我们发现,在剽窃检测和数学内容相似性方面,最佳性能方法的整体检测得分(PlagDet)分别仅为0.06和0.16。这些最佳性能方法未能检测出所有七种新建立的数学相似性类型中的大多数案例。所概述的贡献将惠及剽窃检测系统、推荐系统、问答系统和搜索引擎等领域的研究。我们将实验代码和标注数据集公开给社区:https://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism