The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To address the challenge of detecting highly similar paraphrased texts, we propose MhBART, an encoder-decoder model designed to emulate human writing style while incorporating a novel difference score mechanism. This model outperforms strong classifier baselines and identifies deceptive sentence patterns. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets -- 15.5\% absolute improvement on paraLFQA, 4\% absolute improvement on paraWP, and 1.5\% absolute improvement on M4 compared to SOTA approaches.
翻译:大型语言模型(LLM)高质量API的普及促进了机器生成内容(MGC)的广泛产生,引发了学术抄袭和虚假信息传播等挑战。现有MGC检测器通常仅关注表层信息,忽视了隐式与结构特征,使其易受表层句法模式的欺骗,尤其在较长文本及经过后续复述的文本中更为明显。为应对这些挑战,我们提出了新的方法论与数据集。除公开数据集Plagbench外,我们利用GPT及篇章复述工具DIPPER,通过扩展原始版本的人工标注数据,构建了复述长格式问答(paraLFQA)与复述写作提示(paraWP)数据集。针对高度相似复述文本的检测难题,我们提出MhBART编码器-解码器模型,该模型通过模拟人类写作风格并结合创新的差异评分机制,在强分类器基线上实现性能超越,并能识别欺骗性句法模式。为在文档层面更好地捕捉长文本结构,我们提出DTransformer模型,该模型通过PDTB预处理整合篇章分析以编码结构特征。相较于前沿方法,该模型在两个数据集上均取得显著性能提升——在paraLFQA上绝对提升15.5%,在paraWP上绝对提升4%,在M4上绝对提升1.5%。