Framing is an essential device in news reporting, allowing the writer to influence public perceptions of current affairs. While there are existing automatic news framing detection datasets in various languages, none of them focus on news framing in the Chinese language which has complex character meanings and unique linguistic features. This study introduces the first Chinese News Framing dataset, to be used as either a stand-alone dataset or a supplementary resource to the SemEval-2023 task 3 dataset. We detail its creation and we run baseline experiments to highlight the need for such a dataset and create benchmarks for future research, providing results obtained through fine-tuning XLM-RoBERTa-Base and using GPT-4o in the zero-shot setting. We find that GPT-4o performs significantly worse than fine-tuned XLM-RoBERTa across all languages. For the Chinese language, we obtain an F1-micro (the performance metric for SemEval task 3, subtask 2) score of 0.719 using only samples from our Chinese News Framing dataset and a score of 0.753 when we augment the SemEval dataset with Chinese news framing samples. With positive news frame detection results, this dataset is a valuable resource for detecting news frames in the Chinese language and is a valuable supplement to the SemEval-2023 task 3 dataset.
翻译:框架是新闻报道中的一种关键手段,能使作者影响公众对时事的认知。尽管目前已存在多种语言的自动新闻框架检测数据集,但尚无专注于汉语新闻框架的研究,而汉语具有复杂的字义和独特的语言特征。本研究首次引入了中文新闻框架数据集,该数据集可作为独立资源使用,也可作为SemEval-2023任务3数据集的补充资源。我们详细阐述了其构建过程,并通过基线实验论证了此类数据集的必要性,同时为未来研究建立了基准——实验通过微调XLM-RoBERTa-Base模型及采用零样本设置的GPT-4o获得结果。研究发现,在所有语言中GPT-4o的表现均显著逊于微调后的XLM-RoBERTa。针对中文语言,仅使用本中文新闻框架数据集样本时获得F1-micro(SemEval任务3子任务2的性能指标)得分0.719,而将SemEval数据集与中文新闻框架样本结合后得分提升至0.753。凭借积极的新闻框架检测结果,本数据集成为检测中文新闻框架的宝贵资源,也是对SemEval-2023任务3数据集的重要补充。