News headlines often evoke sentiment by intentionally portraying entities in particular ways, making targeted sentiment analysis (TSA) of headlines a worthwhile but difficult task. Due to its subjectivity, creating TSA datasets can involve various annotation paradigms, from descriptive to prescriptive, either encouraging or limiting subjectivity. LLMs are a good fit for TSA due to their broad linguistic and world knowledge and in-context learning abilities, yet their performance depends on prompt design. In this paper, we compare the accuracy of state-of-the-art LLMs and fine-tuned encoder models for TSA of news headlines using descriptive and prescriptive datasets across several languages. Exploring the descriptive--prescriptive continuum, we analyze how performance is affected by prompt prescriptiveness, ranging from plain zero-shot to elaborate few-shot prompts. Finally, we evaluate the ability of LLMs to quantify uncertainty via calibration error and comparison to human label variation. We find that LLMs outperform fine-tuned encoders on descriptive datasets, while calibration and F1-score generally improve with increased prescriptiveness, yet the optimal level varies.
翻译:新闻标题常通过刻意以特定方式描绘实体来引发情感反应,使得针对标题的定向情感分析成为一项有价值但困难的任务。由于其主观性,构建定向情感分析数据集可能涉及从描述性到规范性的多种标注范式,这些范式可能鼓励或限制主观性表达。大型语言模型凭借其广泛的语言与世界知识以及上下文学习能力,非常适合定向情感分析任务,但其性能取决于提示设计。本文通过使用多种语言的描述性与规范性数据集,比较了前沿大型语言模型与微调编码器模型在新闻标题定向情感分析中的准确率。通过探索描述性-规范性连续谱,我们分析了提示的规范程度(从简单零样本提示到精细少样本提示)如何影响模型性能。最后,我们评估了大型语言模型通过校准误差及与人工标注变异对比来量化不确定性的能力。研究发现:在描述性数据集上,大型语言模型优于微调编码器模型;校准效果与F1分数通常随规范性增强而提升,但最优规范程度存在差异。