Controlling emergent behavioral personas (e.g., sycophancy, hallucination) in Large Language Models (LLMs) is critical for AI safety, yet remains a persistent challenge. Existing solutions face a dilemma: manual prompt engineering is intuitive but unscalable and imprecise, while automatic optimization methods are effective but operate as "black boxes" with no interpretable connection to model internals. We propose a novel framework that adapts gradient ascent to LLMs, enabling targeted prompt discovery. In specific, we propose two methods, RESGA and SAEGA, that both optimize randomly initialized prompts to achieve better aligned representation with an identified persona direction. We introduce fluent gradient ascent to control the fluency of discovered persona steering prompts. We demonstrate RESGA and SAEGA's effectiveness across Llama 3.1, Qwen 2.5, and Gemma 3 for steering three different personas,sycophancy, hallucination, and myopic reward. Crucially, on sycophancy, our automatically discovered prompts achieve significant improvement (49.90% compared with 79.24%). By grounding prompt discovery in mechanistically meaningful features, our method offers a new paradigm for controllable and interpretable behavior modification.
翻译:控制大型语言模型(LLM)中涌现的行为角色(如谄媚性、幻觉)对人工智能安全至关重要,但仍是持续存在的挑战。现有解决方案面临两难困境:手动提示工程直观但不可扩展且不精确,而自动优化方法虽有效却如同“黑箱”,与模型内部缺乏可解释的关联。我们提出一种新颖框架,将梯度上升方法适配于LLM,从而实现定向提示发现。具体而言,我们提出了RESGA和SAEGA两种方法,均通过优化随机初始化的提示,使其与已识别的角色方向实现更佳对齐的表征。我们引入流畅梯度上升以控制所发现角色引导提示的流畅度。我们在Llama 3.1、Qwen 2.5和Gemma 3模型上验证了RESGA和SAEGA在引导三种不同角色(谄媚性、幻觉和短视奖励)方面的有效性。关键的是,在谄媚性控制上,我们自动发现的提示实现了显著改进(从49.90%提升至79.24%)。通过将提示发现建立在具有机制意义的特征基础上,我们的方法为可控且可解释的行为修正提供了新范式。