Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically. While these tuning methods can help align models with human objectives and generate high-quality text, not much is known about their potential adverse effects. In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs, focusing on three cognitive biases - the decoy effect, the certainty effect, and the belief bias - all of which are known to influence human decision-making and reasoning. Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families. Notably, we find a stronger presence of biases in models that have undergone instruction tuning, such as Flan-T5, Mistral-Instruct, GPT3.5, and GPT4. Our work constitutes a step toward comprehending cognitive biases in instruction-tuned LMs, which is crucial for the development of more reliable and unbiased language models.
翻译:近期研究表明,指令调优(Instruction Tuning, IT)和基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)显著提升了大型语言模型(LMs)的能力。尽管这些调优方法有助于使模型与人类目标对齐并生成高质量文本,但其潜在负面影响尚不明确。本研究聚焦于三种已知影响人类决策与推理的认知偏差——诱饵效应、确定效应和信念偏差,系统考察IT和RLHF对语言模型决策与推理能力的影响。我们的发现揭示了GPT-3、Mistral和T5系列模型中都存在这些偏差,尤以经过指令调优的模型(如Flan-T5、Mistral-Instruct、GPT3.5和GPT4)表现更为突出。此项研究为理解指令调优语言模型中的认知偏差迈出了关键一步,这对开发更可靠、无偏的语言模型具有重要意义。