Large language models (LLMs) have the potential to generate texts that pose risks of misuse, such as plagiarism, planting fake reviews on e-commerce platforms, or creating inflammatory false tweets. Consequently, detecting whether a text is generated by LLMs has become increasingly important. Existing high-quality detection methods usually require access to the interior of the model to extract the intrinsic characteristics. However, since we do not have access to the interior of the black-box model, we must resort to surrogate models, which impacts detection quality. In order to achieve high-quality detection of black-box models, we would like to extract deep intrinsic characteristics of the black-box model generated texts. We view the generation process as a coupled process of prompt and intrinsic characteristics of the generative model. Based on this insight, we propose to decouple prompt and intrinsic characteristics (DPIC) for LLM-generated text detection method. Specifically, given a candidate text, DPIC employs an auxiliary LLM to reconstruct the prompt corresponding to the candidate text, then uses the prompt to regenerate text by the auxiliary LLM, which makes the candidate text and the regenerated text align with their prompts, respectively. Then, the similarity between the candidate text and the regenerated text is used as a detection feature, thus eliminating the prompt in the detection process, which allows the detector to focus on the intrinsic characteristics of the generative model. Compared to the baselines, DPIC has achieved an average improvement of 6.76\% and 2.91\% in detecting texts from different domains generated by GPT4 and Claude3, respectively.
翻译:大语言模型(LLMs)可能生成存在滥用风险的文本,例如抄袭、在电商平台植入虚假评论或发布煽动性虚假推文。因此,检测文本是否由LLMs生成变得日益重要。现有高质量检测方法通常需要访问模型内部以提取内在特征。然而,由于我们无法访问黑盒模型的内部结构,必须借助替代模型,这会影响检测质量。为实现对黑盒模型的高质量检测,我们需提取黑盒模型生成文本的深层内在特征。我们将文本生成过程视为生成模型的提示与内在特征的耦合过程。基于这一见解,我们提出一种解耦提示与内在特征(DPIC)的LLM生成文本检测方法。具体而言,给定候选文本,DPIC利用辅助LLM重建该候选文本对应的提示,再通过辅助LLM基于该提示重新生成文本,使候选文本与重生成文本分别与其提示对齐。随后,将候选文本与重生成文本的相似度作为检测特征,从而消除检测过程中的提示影响,使检测器聚焦于生成模型的内在特征。与基线方法相比,DPIC在检测GPT4和Claude3生成的不同领域文本时,平均分别提升6.76%和2.91%。