Large Language Models (LLMs) possess the potential to exert substantial influence on public perceptions and interactions with information. This raises concerns about the societal impact that could arise if the ideologies within these models can be easily manipulated. In this work, we investigate how effectively LLMs can learn and generalize ideological biases from their instruction-tuning data. Our findings reveal a concerning vulnerability: exposure to only a small amount of ideologically driven samples significantly alters the ideology of LLMs. Notably, LLMs demonstrate a startling ability to absorb ideology from one topic and generalize it to even unrelated ones. The ease with which LLMs' ideologies can be skewed underscores the risks associated with intentionally poisoned training data by malicious actors or inadvertently introduced biases by data annotators. It also emphasizes the imperative for robust safeguards to mitigate the influence of ideological manipulations on LLMs.
翻译:大型语言模型(LLMs)具备对公众认知和信息交互产生重大影响的潜力。这引发了一个担忧:如果这些模型中的意识形态可以轻易被操控,可能会对社会产生何种影响。本文探究了LLMs从其指令微调数据中学习和泛化意识形态偏见的效率。研究结果揭示了令人警觉的脆弱性:仅接触少量含有意识形态倾向的样本,就显著改变了LLMs的意识形态立场。值得注意的是,LLMs展现出惊人的能力,能够从某一主题中吸收意识形态,并将其泛化至甚至不相关的主题。LLMs意识形态偏颇的易操纵性,凸显了恶意行为者故意投毒训练数据或数据标注者无意引入偏见所带来的风险。同时,这也强调了建立稳健防护机制以减轻意识形态操控对LLMs影响的紧迫性。