Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model's latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with the safety rankings derived from Red Teaming. N-GLARE reproduces the discriminative trends of large-scale red-teaming tests at less than 1\% of the token cost and the runtime cost, providing an efficient output-free evaluation proxy for real-time diagnostics.
翻译:评估大型语言模型(LLM)的安全性鲁棒性对其部署至关重要。然而,主流的红队测试方法依赖于在线生成和黑盒输出分析。这些方法不仅成本高昂,而且存在反馈延迟,使其不适用于新模型训练后的敏捷诊断。为解决此问题,我们提出了N-GLARE(一种非生成式、潜在表征高效的大型语言模型安全性评估器)。N-GLARE完全基于模型的潜在表征进行操作,绕过了完整文本生成的需求。它通过分析潜在表征的APT(角概率轨迹)并引入JSS(詹森-香农可分离性)度量,来刻画隐藏层动态。在超过40个模型和20种红队测试策略上的实验表明,JSS度量与红队测试得出的安全性排名具有高度一致性。N-GLARE以低于1%的令牌成本和运行时成本,复现了大规模红队测试的判别趋势,为实时诊断提供了一种高效的无输出评估代理。