Large Language Models (LLMs) are widely used for text generation, making it crucial to address potential bias. This study investigates ideological framing bias in LLM-generated articles, focusing on the subtle and subjective nature of such bias in journalistic contexts. We evaluate eight widely used LLMs on two datasets-POLIGEN and ECONOLEX-covering political and economic discourse where framing bias is most pronounced. Beyond text generation, LLMs are increasingly used as evaluators (LLM-as-a-judge), providing feedback that can shape human judgment or inform newer model versions. Inspired by the Socratic method, we further analyze LLMs' feedback on their own outputs to identify inconsistencies in their reasoning. Our results show that most LLMs can accurately annotate ideologically framed text, with GPT-4o achieving human-level accuracy and high agreement with human annotators. However, Socratic probing reveals that when confronted with binary comparisons, LLMs often exhibit preference toward one perspective or perceive certain viewpoints as less biased.
翻译:大型语言模型(LLM)被广泛用于文本生成,因此解决其潜在偏见至关重要。本研究探讨LLM生成文章中的意识形态框架偏见,重点关注此类偏见在新闻语境中微妙且主观的特性。我们在两个数据集——POLIGEN与ECONOLEX——上评估了八个广泛使用的LLM,这两个数据集涵盖政治与经济论述领域,其中框架偏见最为显著。除文本生成外,LLM日益被用作评估工具(LLM即评判者),其提供的反馈可能影响人类判断或为新版模型提供参考。受苏格拉底方法启发,我们进一步分析LLM对其自身输出的反馈,以识别其推理过程中的不一致性。研究结果表明,大多数LLM能够准确标注具有意识形态框架的文本,其中GPT-4o达到人类水平的准确性,并与人工标注者高度一致。然而,苏格拉底式探询揭示:当面临二元比较时,LLM常表现出对某一视角的偏好,或认为某些观点偏见较少。