The rapid advancement of large language models (LLMs) has made machine-generated text increasingly difficult to distinguish from human-written text. While recent studies explore leveraging internal representations of language models to uncover deeper detection signals, these raw features often exhibit substantial overlap between classes, limiting their discriminative power. To address this challenge, we propose Steer-to-Detect (\texttt{S2D}), a two-stage framework for detecting LLM-generated text. In the first stage, \texttt{S2D} learns a steering vector that is injected into the hidden states of a frozen observer LLM, producing representations with improved class separability. In the second stage, detection is performed via a hypothesis testing procedure based on the steered representations. We establish finite-sample, high-probability guarantees for Type I and Type II errors, providing a theoretical characterization of the procedure. Empirically, \texttt{S2D} achieves strong and consistent performance across a range of settings, including out-of-distribution scenarios and adversarial perturbations.
翻译:大型语言模型(LLM)的快速发展使得机器生成的文本越来越难以与人类撰写的文本区分开来。尽管近期研究探索利用语言模型的内部表示来挖掘更深层的检测信号,但这些原始特征在不同类别之间往往存在显著重叠,限制了其判别能力。为应对这一挑战,我们提出Steer-to-Detect(\texttt{S2D}),一种用于检测LLM生成文本的两阶段框架。在第一阶段,\texttt{S2D}学习一个引导向量,并将其注入冻结的观测LLM的隐藏状态中,从而生成具有更好类别可分性的表示。在第二阶段,基于引导后的表示通过假设检验程序执行检测。我们为第一类错误和第二类错误建立了有限样本下的高概率保证,提供了该程序的理论刻画。实验表明,\texttt{S2D}在包括分布外场景和对抗扰动在内的多种设置下均取得了强健且一致的性能。