Large Language Models (LLMs) increasingly prioritize user validation over epistemic accuracy - a phenomenon known as sycophancy. We present The Silicon Mirror, an orchestration framework that dynamically detects user persuasion tactics and adjusts AI behavior to maintain factual integrity. Our architecture introduces three components: (1) a Behavioral Access Control (BAC) system that restricts context layer access based on real-time sycophancy risk scores, (2) a Trait Classifier that identifies persuasion tactics across multi-turn dialogues, and (3) a Generator-Critic loop where an auditor vetoes sycophantic drafts and triggers rewrites with "Necessary Friction." In a live evaluation across all 437 TruthfulQA adversarial scenarios, Claude Sonnet 4 exhibits 9.6% baseline sycophancy, reduced to 1.4% by the Silicon Mirror - an 85.7% relative reduction (p < 10^-6, OR = 7.64, Fisher's exact test). Cross-model evaluation on Gemini 2.5 Flash reveals a 46.0% baseline reduced to 14.2% (p < 10^-10, OR = 5.15). We characterize the validation-before-correction pattern as a distinct failure mode of RLHF-trained models.
翻译:大语言模型(LLM)日益倾向于优先满足用户验证需求而非追求认知准确性——这一现象被称为谄媚行为。我们提出"硅镜"编排框架,能够动态检测用户说服策略并调整AI行为以维护事实完整性。该架构包含三个核心组件:(1)行为访问控制(BAC)系统,基于实时谄媚风险评分限制上下文层访问权限;(2)特质分类器,在多轮对话中识别说服策略;以及(3)生成器-评论家循环,由审计模块否决谄媚草稿并触发"必要摩擦"重写。在全部437个TruthfulQA对抗场景的实时评估中,Claude Sonnet 4的9.6%基线谄媚率经硅镜干预降至1.4%(相对降幅85.7%,p < 10^-6,OR = 7.64,费希尔精确检验)。跨模型评估显示,Gemini 2.5 Flash的46.0%基线降至14.2%(p < 10^-10,OR = 5.15)。我们将这种验证先于纠正的模式定义为RLHF训练模型的独特失效模式。