This work addresses critical challenges to academic integrity, including plagiarism, fabrication, and verification of authorship of educational content, by proposing a Natural Language Processing (NLP)-based framework for authenticating students' content through author attribution and style change detection. Despite some initial efforts, several aspects of the topic are yet to be explored. In contrast to existing solutions, the paper provides a comprehensive analysis of the topic by targeting four relevant tasks, including (i) classification of human and machine text, (ii) differentiating in single and multi-authored documents, (iii) author change detection within multi-authored documents, and (iv) author recognition in collaboratively produced documents. The solutions proposed for the tasks are evaluated on two datasets generated with Gemini using two different prompts, including a normal and a strict set of instructions. During experiments, some reduction in the performance of the proposed solutions is observed on the dataset generated through the strict prompt, demonstrating the complexities involved in detecting machine-generated text with cleverly crafted prompts. The generated datasets, code, and other relevant materials are made publicly available on GitHub, which are expected to provide a baseline for future research in the domain.
翻译:本研究针对学术诚信中的关键挑战,包括剽窃、捏造及教育内容作者身份验证,提出了一种基于自然语言处理(NLP)的框架,通过作者归属与风格变化检测对学生内容进行认证。尽管已有初步探索,该主题的若干方面仍有待深入研究。与现有解决方案不同,本文通过聚焦四项相关任务对该主题进行全面分析,包括:(i)人类与机器文本分类,(ii)单作者与多作者文档区分,(iii)多作者文档内部作者变更检测,以及(iv)协作生成文档中的作者识别。针对这些任务提出的解决方案,在使用两种不同提示(包括常规指令集和严格指令集)通过Gemini生成的两个数据集上进行了评估。实验发现,在通过严格提示生成的数据集上,所提解决方案的性能出现一定下降,这揭示了检测经过巧妙设计的提示所生成的机器文本所涉及的复杂性。生成的数据集、代码及相关材料已在GitHub上公开,预计将为该领域未来研究提供基准。