A diversity of tasks use language models trained on semantic similarity data. While there are a variety of datasets that capture semantic similarity, they are either constructed from modern web data or are relatively small datasets created in the past decade by human annotators. This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a massive-scale semantic similarity dataset spanning 70 years from 1920 to 1989 and containing nearly 400M positive semantic similarity pairs. Historically, around half of articles in U.S. local newspapers came from newswires like the Associated Press. While local papers reproduced articles from the newswire, they wrote their own headlines, which form abstractive summaries of the associated articles. We associate articles and their headlines by exploiting document layouts and language understanding. We then use deep neural methods to detect which articles are from the same underlying source, in the presence of substantial noise and abridgement. The headlines of reproduced articles form positive semantic similarity pairs. The resulting publicly available HEADLINES dataset is significantly larger than most existing semantic similarity datasets and covers a much longer span of time. It will facilitate the application of contrastively trained semantic similarity models to a variety of tasks, including the study of semantic change across space and time.
翻译:多种任务使用基于语义相似度数据训练的语言模型。尽管已有多种捕捉语义相似度的数据集,但它们要么来自现代网络数据,要么是过去十年由人工标注者构建的相对较小的数据集。本研究利用一种新颖的数据源——最近数字化的美国地方报纸(版权已过期)——构建了一个大规模语义相似度数据集,时间跨度从1920年至1989年共计70年,包含近4亿个正例语义相似对。历史上,美国地方报纸约半数文章来自美联社等新闻通讯社。地方报纸在转载通讯社文章时,会自行撰写标题,这些标题构成了对相应文章的摘要性总结。我们通过利用文档布局和语言理解将文章与其标题进行关联,随后使用深度神经网络方法在存在大量噪声和删节的情况下检测哪些文章源自同一原始来源。这些转载文章的标题构成了正例语义相似对。最终公开的HEADLINES数据集在规模上显著大于现有大多数语义相似度数据集,且覆盖了更长的时间跨度。该数据集将促进对比训练语义相似度模型在多种任务中的应用,包括对跨空间和时间的语义变化研究。