Large Language Models (LLMs) possess the potential to exert substantial influence on public perceptions and interactions with information. This raises concerns about the societal impact that could arise if the ideologies within these models can be easily manipulated. In this work, we investigate how effectively LLMs can learn and generalize ideological biases from their instruction-tuning data. Our findings reveal a concerning vulnerability: exposure to only a small amount of ideologically driven samples significantly alters the ideology of LLMs. Notably, LLMs demonstrate a startling ability to absorb ideology from one topic and generalize it to even unrelated ones. The ease with which LLMs' ideologies can be skewed underscores the risks associated with intentionally poisoned training data by malicious actors or inadvertently introduced biases by data annotators. It also emphasizes the imperative for robust safeguards to mitigate the influence of ideological manipulations on LLMs.
翻译:大语言模型(LLMs)具有对公众认知及信息交互方式产生深远影响的潜力。这引发了对这些模型中意识形态若能被轻易操控可能产生的社会影响的担忧。本研究探讨了大语言模型从其指令微调数据中学习并泛化意识形态偏见的有效性。研究结果揭示了一个值得警惕的脆弱性:仅接触少量带有意识形态倾向的样本,就足以显著改变LLM的意识形态立场。值得注意的是,LLM展现出惊人的能力——能将某一主题下的意识形态吸收并泛化至甚至其无关的领域。LLM意识形态易被扭曲的特点,凸显了恶意行为者蓄意污染训练数据或数据标注者无意引入偏差所带来的风险,同时也强调必须建立稳健的防护机制以减轻意识形态操纵对LLM的影响。