揭示Transformer中的幽灵：基于隐藏状态取证的大型语言模型异常检测 (Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics)

The widespread adoption of Large Language Models (LLMs) in critical applications has introduced severe reliability and security risks, as LLMs remain vulnerable to notorious threats such as hallucinations, jailbreak attacks, and backdoor exploits. These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised LLM-embedded system integrity. In this work, we introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics. By systematically inspecting layer-specific activation patterns, we develop a unified framework that can efficiently identify a range of security threats in real-time without imposing prohibitive computational costs. Extensive experiments indicate detection accuracies exceeding 95% and consistently robust performance across multiple models in most scenarios, while preserving the ability to detect novel attacks effectively. Furthermore, the computational overhead remains minimal, with merely fractions of a second. The significance of this work lies in proposing a promising strategy to reinforce the security of LLM-integrated systems, paving the way for safer and more reliable deployment in high-stakes domains. By enabling real-time detection that can also support the mitigation of abnormal behaviors, it represents a meaningful step toward ensuring the trustworthiness of AI systems amid rising security challenges.

翻译：大型语言模型（LLM）在关键应用中的广泛采用带来了严重的可靠性与安全风险，因为LLM仍然容易受到幻觉、越狱攻击和后门利用等常见威胁的影响。这些漏洞已被恶意行为者武器化，导致未经授权的访问、大规模错误信息以及LLM嵌入式系统完整性的破坏。本研究提出了一种通过隐藏状态取证检测LLM异常行为的新方法。通过系统性地检查特定层的激活模式，我们开发了一个统一框架，能够高效实时识别多种安全威胁，且不产生过高的计算成本。大量实验表明，该方法在多数场景下对多个模型实现了超过95%的检测准确率，并始终保持稳健性能，同时保持了对新型攻击的有效检测能力。此外，其计算开销极小，仅需零点几秒。本工作的意义在于提出了一种强化LLM集成系统安全性的可行策略，为在高风险领域实现更安全可靠的部署铺平道路。通过实现可同时支持异常行为缓解的实时检测，本研究为在日益严峻的安全挑战中确保人工智能系统的可信度迈出了重要一步。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/