Large language models often display undesirable behaviors embedded in their internal representations, undermining fairness, inconsistency drift, amplification of harmful content, and the propagation of unwanted patterns during extended dialogue and conversations. Although training-time or data-centric methods attempt to reduce these effects, they are computationally expensive, irreversible once deployed, and slow to adapt to new conversational contexts. Pruning-based methods provide a flexible and transparent way to reduce bias by adjusting the neurons responsible for certain behaviors. However, most existing approaches are static; once a neuron is removed, the model loses the ability to adapt when the conversation or context changes. To address this, we propose a dynamic, reversible, pruning-based framework that detects context-aware neuron activations and applies adaptive masking to modulate their influence during generation. Our inference-time solution provides fine-grained, memory-aware mitigation with knowledge-preserved, more coherent behavior across multilingual single- and multi-turn dialogues, enabling dynamic fairness control in real-world conversational AI.
翻译:大型语言模型常常在其内部表征中展现出不良行为模式,这些模式会损害公平性、导致不一致性漂移、放大有害内容,并在长程对话与会话中传播非期望模式。尽管基于训练阶段或数据中心化的方法试图减轻这些影响,但它们计算成本高昂、部署后不可逆转,且难以适应新的对话语境。基于剪枝的方法通过调整负责特定行为的神经元,为减少偏见提供了灵活透明的途径。然而,现有方法多为静态方案:一旦神经元被移除,模型在对话或语境变化时将丧失适应能力。为此,我们提出了一种动态可逆的基于剪枝的框架,该框架能够检测上下文感知的神经元激活状态,并通过自适应掩码机制在生成过程中调节其影响。我们的推理时解决方案实现了细粒度、内存感知的缓解机制,在多语言单轮及多轮对话中保持知识完整性并提升行为连贯性,从而为现实世界对话式人工智能系统提供动态公平性控制能力。