Recent work in Mechanistic Interpretability (MI) has enabled the identification and intervention of internal features in Large Language Models (LLMs). However, a persistent challenge lies in linking such internal features to the reliable control of complex, behavior-level semantic attributes in language generation. In this paper, we propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features associated with high-level linguistic behaviors. Our method employs a contrastive feature retrieval pipeline based on controlled semantic oppositions, combing statistical activation analysis and generation-based validation to distill monosemantic functional features from sparse activation spaces. Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior while maintaining superior stability and performance compared to existing activation steering methods like Contrastive Activation Addition (CAA). We further identify an empirical effect, which we term Functional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute. Our findings suggest that LLMs internalize deeply integrated representations of high-order concepts, and provide a novel, robust mechanistic path for the regulation of complex AI behaviors.
翻译:近期机制可解释性研究已能识别并干预大型语言模型中的内部特征,但如何将这些内部特征与语言生成中复杂行为级语义属性的可靠控制相连接,仍是持续存在的挑战。本文提出一种基于稀疏自编码器的框架,用于检索和调控与高层次语言行为相关的语义可解释内部特征。该方法采用基于受控语义对立的对比特征检索流程,结合统计激活分析与基于生成的验证,从稀疏激活空间中蒸馏出单语义功能特征。以五大人格特质为案例研究,我们证明相较于对比激活叠加等现有激活调控方法,本方法能实现精确的双向行为调控,同时保持更优的稳定性与性能。我们进一步发现一种经验效应——功能忠实性:对特定内部特征的干预会引发跨多个语言维度、与目标语义属性对齐的连贯且可预测的偏移。研究结果表明,大型语言模型内化了高度整合的高阶概念表征,并为复杂人工智能行为的调控提供了一条新颖而稳健的机制路径。