From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.
翻译:从生成新闻标题到捏造新闻报道,大语言模型(LLMs)通常仅通过最终输出结果进行评估,其安全性评估基于一个隐含假设:模型的拒绝响应意味着整个推理过程是安全的。本研究挑战了这一假设,发现在虚假新闻生成过程中,即使模型拒绝了有害请求,其思维链(CoT)推理仍可能在内部包含并传播不安全的叙事。为分析这一现象,我们提出了一个统一的安全性分析框架,该框架系统解构了思维链生成过程在模型各层的表现,并通过基于雅可比矩阵的谱度量方法评估了各注意力头的作用。在此框架内,我们引入了三个可解释的度量指标:稳定性、几何结构与能量,用以量化特定注意力头如何响应或嵌入欺骗性推理模式。在多个面向推理的大语言模型上进行的大量实验表明,当思维模式被激活时,生成风险显著上升,其中关键的路由决策仅集中在少数连续的中层深度。通过精确定位导致这种分歧的注意力头,我们的研究挑战了“拒绝即安全”的假设,并为缓解潜在推理风险提供了新的理解视角。