The growing complexity and diversity of news coverage have made framing analysis a crucial yet challenging task in computational social science. Traditional approaches, including manual annotation and fine-tuned models, remain limited by high annotation costs, domain specificity, and inconsistent generalisation. Instruction-based large language models (LLMs) offer a promising alternative, yet their reliability for framing analysis remains insufficiently understood. In this paper, we conduct a systematic evaluation of several LLMs, including GPT-3.5/4, FLAN-T5, and Llama 3, across zero-shot, few-shot, and explanation-based prompting settings. Focusing on domain shift and inherent annotation ambiguity, we show that model performance is highly sensitive to prompt design and prone to systematic errors on ambiguous cases. Although LLMs, particularly GPT-4, exhibit stronger cross-domain generalisation, they also display systematic biases, most notably a tendency to conflate emotional language with framing. To enable principled evaluation under real-world topic diversity, we introduce a new dataset of out-of-domain news headlines covering diverse subjects. Finally, by analysing agreement patterns across multiple models on existing framing datasets, we demonstrate that cross-model consensus provides a useful signal for identifying contested annotations, offering a practical approach to dataset auditing in low-resource settings.
翻译:新闻报导日益增长的复杂性与多样性使得框架分析成为计算社会科学中一项关键但具有挑战性的任务。传统方法(包括人工标注和微调模型)仍受限于高昂的标注成本、领域特异性以及泛化能力不一致等问题。基于指令的大型语言模型(LLMs)提供了有前景的替代方案,但它们在框架分析中的可靠性尚未得到充分理解。本文系统评估了包括GPT-3.5/4、FLAN-T5和Llama 3在内的多种大型语言模型,涉及零样本、少样本和基于解释的提示设定。聚焦领域迁移与固有标注歧义性,我们发现模型性能对提示设计高度敏感,且在歧义案例中易于出现系统性错误。尽管大型语言模型(尤其是GPT-4)展现出更强的跨领域泛化能力,但它们也表现出系统性偏差,最显著的是将情感语言与框架混淆的倾向。为在现实话题多样性下实现原则性评估,我们引入了一个涵盖不同主题的域外新闻标题新数据集。最后,通过分析多个模型在现有框架数据集上的一致性模式,我们证明跨模型共识为识别有争议的标注提供了有效信号,为低资源场景下的数据集审计提供了一种实用方法。