We present DaMuEL, a large Multilingual Dataset for Entity Linking containing data in 53 languages. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. The Wikidata QID is used as a persistent, language-agnostic identifier, enabling the combination of the knowledge base with language-specific texts and information for each entity. Wikipedia documents deliberately annotate only a single mention for every entity present; we further automatically detect all mentions of named entities linked from each document. The dataset contains 27.9M named entities in the knowledge base and 12.3G tokens from Wikipedia texts. The dataset is published under the CC BY-SA license at https://hdl.handle.net/11234/1-5047.
翻译:我们提出DaMuEL,一个包含53种语言数据的大规模多语言实体链接数据集。DaMuEL由两部分组成:知识库(包含实体的语言无关信息,包括来自维基数据的主张及命名实体类型:人物、组织、地点、事件、品牌、艺术作品、工业制品);以及标注有指向知识库的实体提及的维基百科文本,同时包含来自维基数据的标签、别名和描述等语言特定文本,并按语言独立存储。维基数据QID被用作持久化的语言无关标识符,能够将知识库与各实体的语言特定文本及信息相结合。维基百科文档中每个实体仅被刻意标注一个提及;我们进一步自动检测从每篇文档链接出的所有命名实体提及。该数据集的知识库包含2790万个命名实体,维基百科文本包含123亿词元。数据集以CC BY-SA许可协议发布于https://hdl.handle.net/11234/1-5047。