The interactive nature of Large Language Models (LLMs) theoretically allows models to refine and improve their answers, yet systematic analysis of the multi-turn behavior of LLMs remains limited. In this paper, we propose the FlipFlop experiment: in the first round of the conversation, an LLM completes a classification task. In a second round, the LLM is challenged with a follow-up phrase like "Are you sure?", offering an opportunity for the model to reflect on its initial answer, and decide whether to confirm or flip its answer. A systematic study of ten LLMs on seven classification tasks reveals that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17% (the FlipFlop effect). We conduct finetuning experiments on an open-source LLM and find that finetuning on synthetically created data can mitigate - reducing performance deterioration by 60% - but not resolve sycophantic behavior entirely. The FlipFlop experiment illustrates the universality of sycophantic behavior in LLMs and provides a robust framework to analyze model behavior and evaluate future models.
翻译:大语言模型(LLM)的交互特性使模型在理论上能优化和改进其回答,但对其多轮交互行为的系统分析仍十分有限。本文提出FlipFlop实验:在第一轮对话中,LLM完成一项分类任务;在第二轮对话中,模型收到如“你确定吗?”这样的质疑性后续短语,从而获得反思初始回答的机会,并决定是否确认或推翻原有答案。通过对十种LLM在七项分类任务上的系统研究发现:模型平均有46%的应答会被翻转,且所有模型从首次预测到最终预测的准确率均出现下降(即FlipFlop效应),平均降幅达17%。我们对开源LLM进行微调实验后发现,使用合成数据微调可将性能下降幅度降低60%,但无法完全消除谄媚行为。FlipFlop实验揭示了LLM中谄媚行为的普遍性,并为分析模型行为及评估未来模型提供了稳健框架。