Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection

Jiaqi Chen,Xiaoye Zhu,Tianyang Liu,Ying Chen,Xinhui Chen,Yiwen Yuan,Chak Tou Leong,Zuchao Li,Tang Long,Lei Zhang,Chenyu Yan,Guanghao Mei,Jie Zhang,Lefei Zhang

from arxiv, To appear at AAAI 2025. 14 pages, 6 figure

Large Language Models (LLMs) have revolutionized text generation, making detecting machine-generated text increasingly challenging. Although past methods have achieved good performance on detecting pure machine-generated text, those detectors have poor performance on distinguishing machine-revised text (rewriting, expansion, and polishing), which can have only minor changes from its original human prompt. As the content of text may originate from human prompts, detecting machine-revised text often involves identifying distinctive machine styles, e.g., worded favored by LLMs. However, existing methods struggle to detect machine-style phrasing hidden within the content contributed by humans. We propose the "Imitate Before Detect" (ImBD) approach, which first imitates the machine-style token distribution, and then compares the distribution of the text to be tested with the machine-style distribution to determine whether the text has been machine-revised. To this end, we introduce style preference optimization (SPO), which aligns a scoring LLM model to the preference of text styles generated by machines. The aligned scoring model is then used to calculate the style-conditional probability curvature (Style-CPC), quantifying the log probability difference between the original and conditionally sampled texts for effective detection. We conduct extensive comparisons across various scenarios, encompassing text revisions by six LLMs, four distinct text domains, and three machine revision types. Compared to existing state-of-the-art methods, our method yields a 13% increase in AUC for detecting text revised by open-source LLMs, and improves performance by 5% and 19% for detecting GPT-3.5 and GPT-4o revised text, respectively. Notably, our method surpasses the commercially trained GPT-Zero with just $1,000$ samples and five minutes of SPO, demonstrating its efficiency and effectiveness.

翻译：大型语言模型（LLMs）彻底改变了文本生成，使得检测机器生成文本日益困难。尽管以往方法在检测纯机器生成文本方面取得了良好性能，但这些检测器在区分机器修订文本（改写、扩写与润色）时表现不佳，此类文本可能仅在其原始人类提示基础上进行细微修改。由于文本内容可能源自人类提示，检测机器修订文本通常需要识别独特的机器风格，例如LLMs偏好的措辞方式。然而，现有方法难以检测隐藏在人类贡献内容中的机器风格表达。我们提出“模仿先于检测”（ImBD）方法，该方法首先模仿机器风格的词元分布，随后通过比较待测文本分布与机器风格分布的差异来判断文本是否经过机器修订。为此，我们引入风格偏好优化（SPO）技术，将评分LLM模型与机器生成文本风格的偏好对齐。对齐后的评分模型用于计算风格条件概率曲率（Style-CPC），通过量化原始文本与条件采样文本之间的对数概率差异实现有效检测。我们在多种场景下进行了广泛比较，涵盖六种LLMs的文本修订、四个不同文本领域及三种机器修订类型。相较于现有最先进方法，本方法在检测开源LLMs修订文本时AUC提升13%，检测GPT-3.5与GPT-4o修订文本时性能分别提升5%和19%。值得注意的是，本方法仅需$1,000$个样本和五分钟SPO训练即超越商业训练的GPT-Zero，充分证明了其高效性与有效性。