The widespread adoption of Large Language Models (LLMs) in critical applications has introduced severe reliability and security risks, as LLMs remain vulnerable to notorious threats such as hallucinations, jailbreak attacks, and backdoor exploits. These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised LLM-embedded system integrity. In this work, we introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics. By systematically inspecting layer-specific activation patterns, we develop a general framework that can efficiently identify a range of security threats in real-time without imposing prohibitive computational costs. Extensive experiments indicate detection accuracies exceeding 95% and consistently robust performance across multiple models in most scenarios, while preserving the ability to detect novel attacks effectively. Furthermore, the computational overhead remains minimal, with detector inference taking merely fractions of a second. The significance of this work lies in proposing a promising strategy to reinforce the security of LLM-integrated systems, paving the way for safer and more reliable deployment in high-stakes domains. By enabling real-time detection that can also support the mitigation of abnormal behaviors, it represents a meaningful step toward ensuring the trustworthiness of AI systems amid rising security challenges.
翻译:大型语言模型(LLM)在关键应用中的广泛采用引入了严重的可靠性与安全风险,因为LLM仍易受幻觉、越狱攻击和后门利用等公认威胁的影响。这些漏洞已被恶意行为者武器化,导致未经授权的访问、广泛的信息传播以及LLM嵌入式系统完整性的受损。在本工作中,我们提出了一种通过隐藏状态取证检测LLM异常行为的新颖方法。通过系统检查逐层激活模式,我们开发了一个通用框架,能够在不施加过高计算成本的情况下实时高效识别一系列安全威胁。大量实验表明,在多数场景中,检测准确率超过95%,且跨多个模型保持稳健性能,同时保留有效检测新型攻击的能力。此外,计算开销极低,检测器推理仅需几分之一秒。本工作的重要性在于提出了一种强化LLM集成系统安全性的有前景策略,为高风险领域更安全可靠的部署铺平道路。通过实现实时检测并支持异常行为的缓解,它标志着在日益严峻的安全挑战中确保AI系统可信度迈出了有意义的一步。