News articles are driven by the informational sources journalists use in reporting. Modeling when, how and why sources get used together in stories can help us better understand the information we consume and even help journalists with the task of producing it. In this work, we take steps toward this goal by constructing the largest and widest-ranging annotated dataset, to date, of informational sources used in news writing. We show that our dataset can be used to train high-performing models for information detection and source attribution. We further introduce a novel task, source prediction, to study the compositionality of sources in news articles. We show good performance on this task, which we argue is an important proof for narrative science exploring the internal structure of news articles and aiding in planning-based language generation, and an important step towards a source-recommendation system to aid journalists.
翻译:新闻文章以记者报道时所使用的信息源为驱动。对报道中信息源何时、如何以及为何被组合使用进行建模,有助于我们更好地理解所消费的信息,甚至能协助记者完成新闻生产工作。本研究通过构建迄今规模最大、覆盖范围最广的新闻写作信息源标注数据集,向此目标迈出了关键一步。实验表明,该数据集可用于训练高性能的信息检测与来源归因模型。我们进一步提出了原创性的"信息源预测"任务,用于研究新闻文章中信息源的组合规律。在该任务上取得的优异表现,不仅为探索新闻文章内部结构、辅助基于规划的文本生成的叙事科学提供了重要实证,更是向构建辅助记者的信息源推荐系统迈出的关键一步。