Standard decoding strategies for text generation, including top-k, nucleus sampling, and contrastive search, select tokens based on likelihood, restricting selection to high-probability regions. Human language production operates differently: tokens are chosen for communicative appropriateness rather than statistical frequency. This mismatch creates a truncation blind spot: contextually appropriate but statistically rare tokens remain accessible to humans yet unreachable by likelihood-based decoding. We hypothesize this contributes to the detectability of machine-generated text. Analyzing over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations, we find that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers trained on predictability and lexical diversity achieve remarkable detection rates. Crucially, neither model scale nor architecture correlates strongly with detectability; truncation parameters account for most variance. Configurations achieving low detectability often produce incoherent text, indicating that evading detection and producing natural text are distinct objectives. These findings suggest detectability is enhanced by likelihood-based token selection, not merely a matter of model capability.
翻译:标准文本生成解码策略(包括top-k、核采样及对比搜索)依据似然度选择词汇,将候选范围限制在高概率区域。而人类语言产出机制截然不同:词汇选择基于交际适切性而非统计频率。这种不匹配导致截断盲点的产生:符合上下文语境但统计罕见的词汇对人类可及,却无法被基于似然度的解码策略获取。我们假设这一现象增强了机器生成文本的可检测性。通过分析跨越八个语言模型、五种解码策略及53组超参数配置的180余万篇文本,我们发现8%-18%的人类选择词汇位于典型截断边界之外。基于可预测性与词汇多样性训练的简单分类器取得了显著检测率。关键在于,模型规模与架构均与可检测性无强相关,截断参数解释了大部分方差。低可检测性配置常产生语无伦次的文本,表明规避检测与生成自然文本是不同目标。这些发现表明可检测性源于基于似然度的词汇选择机制,而非单纯取决于模型能力。