Uncovering latent values and opinions in large language models (LLMs) can help identify biases and mitigate potential harm. Recently, this has been approached by presenting LLMs with survey questions and quantifying their stances towards morally and politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and there are many ways to argue for or against a given position. In this work, we propose to address this by analysing a large and robust dataset of 156k LLM responses to the 62 propositions of the Political Compass Test (PCT) generated by 6 LLMs using 420 prompt variations. We perform coarse-grained analysis of their generated stances and fine-grained analysis of the plain text justifications for those stances. For fine-grained analysis, we propose to identify tropes in the responses: semantically similar phrases that are recurrent and consistent across different prompts, revealing patterns in the text that a given LLM is prone to produce. We find that demographic features added to prompts significantly affect outcomes on the PCT, reflecting bias, as well as disparities between the results of tests when eliciting closed-form vs. open domain responses. Additionally, patterns in the plain text rationales via tropes show that similar justifications are repeatedly generated across models and prompts even with disparate stances.
翻译:揭示大型语言模型(LLMs)中潜在的价值观与观点倾向,有助于识别偏见并减轻潜在危害。近期研究通过向LLMs呈现调查问卷,并量化其对涉及道德与政治议题陈述的立场来实现这一目标。然而,LLMs生成的立场可能因提示方式的不同而产生显著差异,且支持或反对某一立场的论证方式多种多样。在本研究中,我们通过分析一个大规模、鲁棒的数据集来解决此问题,该数据集包含6个LLMs针对政治指南针测试(PCT)中62个命题生成的15.6万条响应,这些响应使用了420种提示变体。我们对模型生成的立场进行了粗粒度分析,并对支撑这些立场的纯文本论证进行了细粒度分析。在细粒度分析中,我们提出识别响应中的"惯用表达模式":即在不同提示下反复出现且保持一致的语义相似短语,从而揭示特定LLM倾向于生成的文本模式。研究发现,提示中添加的人口统计学特征会显著影响PCT测试结果,这既反映了模型偏见,也揭示了诱发封闭式与开放式响应时测试结果之间的差异。此外,通过惯用表达模式对纯文本论证的分析表明,即使立场迥异,相似的论证理由也会在不同模型和提示中反复生成。