Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.
翻译:随着先进大语言模型(LLMs)的兴起,人工文本检测(ATD)正变得日益重要。尽管已有诸多尝试,但尚无单一算法能在不同类型未见文本上表现一致,或能保证对新LLMs的有效泛化。可解释性对于实现这一目标至关重要。本研究通过使用稀疏自编码器(SAE)从Gemma-2-2b残差流中提取特征,增强了ATD的可解释性。我们识别出兼具可解释性与高效性的特征,并通过领域与模型特定统计、导向方法以及人工或基于LLM的解析,分析其语义与相关性。我们的方法为理解不同模型生成的文本与人类撰写内容的差异提供了重要洞见。研究表明,现代LLMs具有独特的写作风格,尤其在信息密集领域,尽管它们能通过个性化提示生成类人输出。