We present STADEE, a \textbf{STA}tistics-based \textbf{DEE}p detection method to identify machine-generated text, addressing the limitations of current methods that rely heavily on fine-tuning pre-trained language models (PLMs). STADEE integrates key statistical text features with a deep classifier, focusing on aspects like token probability and cumulative probability, crucial for handling nucleus sampling. Tested across diverse datasets and scenarios (in-domain, out-of-domain, and in-the-wild), STADEE demonstrates superior performance, achieving an 87.05% F1 score in-domain and outperforming both traditional statistical methods and fine-tuned PLMs, especially in out-of-domain and in-the-wild settings, highlighting its effectiveness and generalizability.
翻译:我们提出STADEE,一种基于统计的深度检测方法,用于识别机器生成文本,以解决当前方法过度依赖微调预训练语言模型(PLMs)的局限性。STADEE将关键统计文本特征与深度分类器相结合,重点关注token概率和累积概率等特征,这些特征对于处理核采样至关重要。在多种数据集和场景(领域内、领域外及现实场景)下的测试表明,STADEE展现了优越性能,在领域内实现了87.05%的F1分数,并且(在领域外和现实场景下)优于传统统计方法和微调后的预训练模型,突显了其有效性和泛化能力。