文档级复述机器生成内容检测：模仿人类写作风格与融入篇章特征 (Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features)

The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To address the challenge of detecting highly similar paraphrased texts, we propose MhBART, an encoder-decoder model designed to emulate human writing style while incorporating a novel difference score mechanism. This model outperforms strong classifier baselines and identifies deceptive sentence patterns. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets -- 15.5\% absolute improvement on paraLFQA, 4\% absolute improvement on paraWP, and 1.5\% absolute improvement on M4 compared to SOTA approaches.

翻译：大型语言模型（LLM）高质量API的普及促进了机器生成内容（MGC）的广泛产生，引发了学术抄袭和虚假信息传播等挑战。现有MGC检测器通常仅关注表层信息，忽视了隐式与结构特征，使其易受表层句法模式的欺骗，尤其在较长文本及经过后续复述的文本中更为明显。为应对这些挑战，我们提出了新的方法论与数据集。除公开数据集Plagbench外，我们利用GPT及篇章复述工具DIPPER，通过扩展原始版本的人工标注数据，构建了复述长格式问答（paraLFQA）与复述写作提示（paraWP）数据集。针对高度相似复述文本的检测难题，我们提出MhBART编码器-解码器模型，该模型通过模拟人类写作风格并结合创新的差异评分机制，在强分类器基线上实现性能超越，并能识别欺骗性句法模式。为在文档层面更好地捕捉长文本结构，我们提出DTransformer模型，该模型通过PDTB预处理整合篇章分析以编码结构特征。相较于前沿方法，该模型在两个数据集上均取得显著性能提升——在paraLFQA上绝对提升15.5%，在paraWP上绝对提升4%，在M4上绝对提升1.5%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日