Large Language Models (LLMs) are widely used for text generation, making it crucial to address potential bias. This study investigates ideological framing bias in LLM-generated articles, focusing on the subtle and subjective nature of such bias in journalistic contexts. We evaluate eight widely used LLMs on two datasets-POLIGEN and ECONOLEX-covering political and economic discourse where framing bias is most pronounced. Beyond text generation, LLMs are increasingly used as evaluators (LLM-as-a-judge), providing feedback that can shape human judgment or inform newer model versions. Inspired by the Socratic method, we further analyze LLMs' feedback on their own outputs to identify inconsistencies in their reasoning. Our results show that most LLMs can accurately annotate ideologically framed text, with GPT-4o achieving human-level accuracy and high agreement with human annotators. However, Socratic probing reveals that when confronted with binary comparisons, LLMs often exhibit preference toward one perspective or perceive certain viewpoints as less biased.
翻译:大型语言模型(LLM)已广泛应用于文本生成领域,这使得解决其潜在偏见问题至关重要。本研究聚焦新闻语境中框架偏见的微妙性与主观性,系统探究了LLM生成文章中的意识形态框架偏见。我们在POLIGEN和ECONOLEX两个数据集上评估了八种主流LLM,这两个数据集分别涵盖政治与经济话语领域——这些领域中框架偏见最为显著。除文本生成功能外,LLM正日益被用作评估工具(LLM-as-a-judge),其提供的反馈可能影响人类判断或指导新版模型的开发。受苏格拉底式诘问法的启发,我们进一步分析了LLM对自身输出的反馈,以识别其推理过程中的不一致性。研究结果表明,大多数LLM能够准确标注具有意识形态框架的文本,其中GPT-4o达到人类水平的标注精度,且与人工标注者保持高度一致性。然而,苏格拉底式探析揭示:当面临二元对比时,LLM常表现出对特定立场的偏好,或认为某些观点具有较低偏见程度。