Large Language Models (LLMs) possess the potential to exert substantial influence on public perceptions and interactions with information. This raises concerns about the societal impact that could arise if the ideologies within these models can be easily manipulated. In this work, we investigate how effectively LLMs can learn and generalize ideological biases from their instruction-tuning data. Our findings reveal a concerning vulnerability: exposure to only a small amount of ideologically driven samples significantly alters the ideology of LLMs. Notably, LLMs demonstrate a startling ability to absorb ideology from one topic and generalize it to even unrelated ones. The ease with which LLMs' ideologies can be skewed underscores the risks associated with intentionally poisoned training data by malicious actors or inadvertently introduced biases by data annotators. It also emphasizes the imperative for robust safeguards to mitigate the influence of ideological manipulations on LLMs.
翻译:大型语言模型(LLMs)具有对公众认知和信息交互产生重大影响的潜力。这引发了人们对这些模型内部意识形态若易被操纵可能带来的社会影响的担忧。在本研究中,我们探究了LLMs从其指令微调数据中学习并泛化意识形态偏见的有效性。我们的研究结果揭示了一个令人担忧的脆弱性:仅接触少量意识形态驱动的样本,就会显著改变LLMs的意识形态。值得注意的是,LLMs展现出一种惊人的能力,能够从一个主题中吸收意识形态并将其泛化至甚至不相关的主题。LLMs的意识形态如此容易被扭曲,突显了恶意行为者故意投毒的训练数据或数据标注者无意引入的偏见所带来的风险。这也强调了建立强大防护措施以减轻意识形态操纵对LLMs影响的迫切性。