Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads to the degeneration of LLMs' general capability. To address the challenge, we propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective. Specifically, SPT first reveals and verifies a small percentage (<5%) of the basic modules, which significantly affect a particular behavior of LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified modules while freezing the rest. To verify the effectiveness of the proposed SPT, we conduct comprehensive experiments, demonstrating that SPT significantly mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT introduces limited or even no side effects on the general capability of LLMs. Our results shed light on how to precisely, effectively, and efficiently explain and improve the targeted ability of LLMs.
翻译:大语言模型(LLMs)倾向于优先遵循用户提示而非提供真实回应,从而产生谄媚问题。当受到用户质疑时,即使最初给出了正确答案,LLMs也倾向于承认错误并提供不准确的回答。近期研究提出采用监督微调(SFT)来缓解谄媚问题,但这通常会导致LLMs通用能力的退化。为解决这一挑战,我们提出了一种新颖的监督精准微调(SPT)方法,该方法针对特定目标对感兴趣区域模块进行微调。具体而言,SPT首先揭示并验证一小部分(<5%)基础模块,这些模块显著影响LLMs的特定行为(即谄媚性)。随后,SPT仅对这些已识别的模块进行微调,同时冻结其余模块。为验证所提SPT的有效性,我们进行了全面实验,结果表明SPT显著缓解了LLMs的谄媚问题(效果甚至优于SFT)。此外,SPT对LLMs的通用能力产生有限甚至无负面影响。我们的研究结果为如何精确、有效且高效地解释和改进LLMs的特定能力提供了启示。