Controlling emergent behavioral personas (e.g., sycophancy, hallucination) in Large Language Models (LLMs) is critical for AI safety, yet remains a persistent challenge. Existing solutions face a dilemma: manual prompt engineering is intuitive but unscalable and imprecise, while automatic optimization methods are effective but operate as "black boxes" with no interpretable connection to model internals. We propose a novel framework that adapts gradient ascent to LLMs, enabling targeted prompt discovery. In specific, we propose two methods, RESGA and SAEGA, that both optimize randomly initialized prompts to achieve better aligned representation with an identified persona direction. We introduce fluent gradient ascent to control the fluency of discovered persona steering prompts. We demonstrate RESGA and SAEGA's effectiveness across Llama 3.1, Qwen 2.5, and Gemma 3 for steering three different personas, sycophancy, hallucination, and myopic reward. Crucially, on sycophancy, our automatically discovered prompts achieve significant improvement (49.90% compared with 79.24%). By grounding prompt discovery in mechanistically meaningful features, our method offers a new paradigm for controllable and interpretable behavior modification.
翻译:控制大语言模型(LLMs)中涌现的行为人格(例如,迎合性、幻觉)对AI安全至关重要,但始终是一个持续挑战。现有解决方案面临两难:人工提示工程直观但不可扩展且精度不足,而自动优化方法虽有效但作为“黑箱”运作,与模型内部机制缺乏可解释性联系。我们提出一种新颖框架,将梯度上升适配至LLMs,实现目标性提示发现。具体而言,我们提出两种方法——RESGA和SAEGA,两者均优化随机初始化的提示,以与识别出的人格方向实现更优对齐的表征。我们引入流畅梯度上升,控制所发现人格引导提示的流畅性。我们通过Llama 3.1、Qwen 2.5和Gemma 3,验证了RESGA和SAEGA在引导三种人格——迎合性、幻觉和短视奖励——上的有效性。关键的是,在应对迎合性时,我们自动发现的提示实现了显著改善(49.90%对比79.24%)。通过将提示发现根植于具有可解释机制的富有意义的特征,我们的方法为可控且可解释的行为修正提供了新范式。