In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.
翻译:在这篇资源论文中,我们介绍了DHPLT,这是一个包含41种不同语言的历时语料库的开放集合。DHPLT基于网络爬取的HPLT数据集;我们使用网络爬取时间戳作为文档创建时间的近似信号。该集合涵盖三个时间段:2011-2015年、2020-2021年以及2024年至今(每种语言每个时间段包含100万篇文档)。我们还为选定的目标词提供了预计算的词类型与词例嵌入以及词汇替换,同时允许其他研究人员使用相同的数据集提出自己的目标词。DHPLT旨在填补当前语义变化建模领域缺乏多语言历时语料库的现状(目前仅覆盖十几种高资源语言)。它为这一领域的各种新实验设置开辟了道路。本文描述的所有资源均可在 https://data.hplt-project.org/three/diachronic/ 获取,并按语言分类。