Understanding the writing frame of news articles is vital for addressing social issues, and thus has attracted notable attention in the fields of communication studies. Yet, assessing such news article frames remains a challenge due to the absence of a concrete and unified standard dataset that considers the comprehensive nuances within news content. To address this gap, we introduce an extended version of a large labeled news article dataset with 16,687 new labeled pairs. Leveraging the pairwise comparison of news articles, our method frees the work of manual identification of frame classes in traditional news frame analysis studies. Overall we introduce the most extensive cross-lingual news article similarity dataset available to date with 26,555 labeled news article pairs across 10 languages. Each data point has been meticulously annotated according to a codebook detailing eight critical aspects of news content, under a human-in-the-loop framework. Application examples demonstrate its potential in unearthing country communities within global news coverage, exposing media bias among news outlets, and quantifying the factors related to news creation. We envision that this news similarity dataset will broaden our understanding of the media ecosystem in terms of news coverage of events and perspectives across countries, locations, languages, and other social constructs. By doing so, it can catalyze advancements in social science research and applied methodologies, thereby exerting a profound impact on our society.
翻译:理解新闻文章的写作框架对于解决社会问题至关重要,因而在传播学领域引起了显著关注。然而,由于缺乏一个考虑新闻内容全面细微差别的具体且统一的标准数据集,评估此类新闻文章框架仍然是一项挑战。为填补这一空白,我们引入了一个大型标注新闻文章数据集的扩展版本,其中包含16,687个新标注的新闻文章对。利用新闻文章的成对比较,我们的方法摆脱了传统新闻框架分析研究中手动识别框架类别的工作。总体而言,我们引入了迄今为止最广泛的跨语言新闻文章相似性数据集,包含跨越10种语言的26,555个标注新闻文章对。每个数据点均在人在回路框架下,根据详细描述新闻内容八个关键方面的编码手册进行了精细标注。应用示例展示了其在揭示全球新闻报道中的国家社群、暴露新闻机构间的媒体偏见以及量化与新闻创作相关因素方面的潜力。我们预见,这一新闻相似性数据集将从事件新闻报道和跨国家、地区、语言及其他社会建构的视角等方面,拓宽我们对媒体生态系统的理解。通过这种方式,它可以推动社会科学研究和应用方法论的进步,从而对我们的社会产生深远影响。