Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.
翻译:防护模型广泛应用于检测用户提示和LLM响应中的有害内容。然而,最先进的防护模型仅依赖终端层表示,忽略了分布在内部层中丰富的安全相关特征。我们提出SIREN,一种轻量级防护模型,利用这些内部特征。通过线性探针识别安全神经元,并采用自适应层加权策略对其进行整合,SIREN在不修改底层模型的情况下,从LLM内部构建有害性检测器。我们的全面评估表明,SIREN在多个基准测试上显著优于最先进的开源防护模型,同时使用的可训练参数减少250倍。此外,SIREN展现出对未见基准的卓越泛化能力,自然支持实时流式检测,并相比生成式防护模型显著提升推理效率。总体而言,我们的结果凸显了LLM内部状态作为实用、高性能有害性检测的有前景基础。