Controlling the output of Large Language Models (LLMs) is a central challenge for their reliable deployment, yet a clear understanding of the involved trade-offs remains elusive. Current approaches to conditioning are often evaluated with a narrow focus on their effectiveness at injecting or removing a target concept, neglecting generation quality. We systematically investigate a range of conditioning methods in both injection and removal scenarios. We find that efficient steering methods frequently achieve conditioning at a steep cost to fluency. Furthermore, we identify a critical yet previously overlooked interaction with the training paradigm: activation steering methods are far less effective on instruction-tuned models than on their base counterparts. Simple prompting and full-fledged supervised fine-tuning, on the other hand, are viable options for concept injection, but are not as good at concept removal. Finally, cheaply computed textual metrics highly correlate to costly LLM-as-judge scores, and provide insights on the behavior of conditioning methods.
翻译:控制大语言模型的输出是其可靠部署的核心挑战,然而对其中涉及的权衡机制仍缺乏清晰理解。当前条件控制方法常以单一维度评估其对目标概念的注入或移除效果,忽视了生成质量。我们系统研究了多种条件控制方法在概念注入与移除场景中的表现,发现高效引导方法常以显著牺牲文本流畅性为代价实现条件控制。此外,我们识别出一个被忽视的关键交互现象:激活引导方法在指令微调模型上的效果远逊于基础模型。相比之下,简单提示方法和完整监督微调虽对概念注入有效,但在概念移除方面表现欠佳。最后,低计算成本的文本指标与高成本的"大模型评审"评分高度相关,为条件控制方法的行为分析提供了洞见。