As Large Language Models (LLMs) grow increasingly powerful, ensuring their safety and alignment with human values remains a critical challenge. Ideally, LLMs should provide informative responses while avoiding the disclosure of harmful or sensitive information. However, current alignment approaches, which rely heavily on refusal strategies, such as training models to completely reject harmful prompts or applying coarse filters are limited by their binary nature. These methods either fully deny access to information or grant it without sufficient nuance, leading to overly cautious responses or failures to detect subtle harmful content. For example, LLMs may refuse to provide basic, public information about medication due to misuse concerns. Moreover, these refusal-based methods struggle to handle mixed-content scenarios and lack the ability to adapt to context-dependent sensitivities, which can result in over-censorship of benign content. To overcome these challenges, we introduce HiddenGuard, a novel framework for fine-grained, safe generation in LLMs. HiddenGuard incorporates Prism (rePresentation Router for In-Stream Moderation), which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content by leveraging intermediate hidden states. This fine-grained approach allows for more nuanced, context-aware moderation, enabling the model to generate informative responses while selectively redacting or replacing sensitive information, rather than outright refusal. We also contribute a comprehensive dataset with token-level fine-grained annotations of potentially harmful information across diverse contexts. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content while preserving the overall utility and informativeness of the model's responses.
翻译:随着大型语言模型(LLM)日益强大,确保其安全性与人类价值观对齐仍是关键挑战。理想情况下,LLM应在提供信息性响应的同时避免泄露有害或敏感信息。然而,当前主要依赖拒绝策略的对齐方法(例如训练模型完全拒绝有害提示或应用粗粒度过滤器)受限于其二元性质。这些方法要么完全拒绝信息访问,要么不加区分地授予访问权限,导致响应过度谨慎或无法检测细微有害内容。例如,LLM可能因滥用担忧而拒绝提供关于药物的基础公开信息。此外,这些基于拒绝的方法难以处理混合内容场景,且缺乏适应上下文相关敏感性的能力,可能导致对良性内容的过度审查。为克服这些挑战,我们提出HiddenGuard——一种用于LLM细粒度安全生成的新型框架。HiddenGuard集成Prism(流内审核表征路由器),该组件与LLM并行运行,通过利用中间隐藏状态实现实时、词元级的有害内容检测与编辑。这种细粒度方法支持更具差异性、上下文感知的审核机制,使模型能够生成信息性响应,同时选择性编辑或替换敏感信息,而非直接拒绝。我们还贡献了一个包含多场景下潜在有害信息词元级细粒度标注的综合性数据集。实验表明,HiddenGuard在检测与编辑有害内容方面达到90%以上的F1分数,同时保持了模型响应的整体效用与信息量。