The assessment of bias within Large Language Models (LLMs) has emerged as a critical concern in the contemporary discourse surrounding Artificial Intelligence (AI) in the context of their potential impact on societal dynamics. Especially, recognizing and considering political bias within LLM applications is central when closing in on the tipping point toward performative prediction. Then, being educated about potential effects and the societal behavior LLMs can drive at scale due to their interplay with human operators. In this way, the upcoming elections of the European Parliament will not remain unaffected by LLMs. We evaluate the political bias of the currently most popular open-source LLMs (instruct or assistant models) concerning political issues within the European Union (EU) from a German voter's perspective. To do so, we use the "Wahl-O-Mat", a voting advice application used in Germany. From the voting advice of the "Wahl-O-Mat" we quantize the degree of alignment of LLMs with German political parties. We show that larger models, such as Llama3-70B, tend to align more closely with left-leaning political parties, while smaller models often remain neutral, particularly when prompted in English. The central finding is, that LLMs are similarly biased, with low variances in the alignment with respect to a specific party. Our findings underline the importance of rigorously assessing and making bias transparent in LLMs to safeguard the integrity and trustworthiness of applications that employ the capabilities of performative prediction and the invisible hand of machine learning prediction and language generation.
翻译:大型语言模型(LLMs)中的偏见评估已成为当代围绕人工智能(AI)及其对社会动态潜在影响的讨论中的关键议题。特别是在趋近于表演性预测临界点时,识别并考量LLM应用中的政治偏见至关重要。进而,我们需要了解LLM因其与人类操作者的互动可能大规模驱动的潜在效应和社会行为。如此,即将到来的欧洲议会选举将无法免受LLMs的影响。我们从德国选民视角出发,评估当前最流行的开源LLMs(指令或助手模型)在欧盟政治议题上的政治偏见。为此,我们采用德国使用的投票建议应用程序“Wahl-O-Mat”。基于“Wahl-O-Mat”的投票建议,我们量化了LLMs与德国各政党的立场契合度。研究表明,较大模型(如Llama3-70B)往往更倾向于左翼政党,而较小模型常保持中立,尤其是在使用英语提示时。核心发现是:LLMs存在相似程度的偏见,且对特定政党的立场契合度方差较低。我们的研究结果强调了严格评估并公开LLMs偏见的重要性,以保障那些运用表演性预测能力及机器学习预测与语言生成“无形之手”的应用的完整性与可信度。