We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline. Building datasets for under-represented languages remains one of the major bottlenecks in natural language processing, where the scarcity of resources and linguistic particularities make large-scale annotation costly and potentially inconsistent. To address these challenges, we propose and evaluate a novel approach that leverages Wikipedia and Wikidata as structured sources of weak supervision. By exploiting internal links within Wikipedia articles, we infer entity types based on their corresponding Wikidata entries, thereby generating initial annotations with minimal human intervention. Because such links are not uniformly reliable, we mitigate noise by employing and comparing several LLMs to identify and retain only high-quality labelled sentences. The resulting corpus is approximately five times larger than the currently available Luxembourgish NER dataset and offers broader and more balanced coverage across entity categories, providing a substantial new resource for multilingual and low-resource NER research.
翻译:我们提出了judgeWEL——一个卢森堡语命名实体识别(NER)数据集,该数据集通过一种新颖的流水线自动标注,并随后使用大型语言模型(LLM)进行验证。为资源匮乏语言构建数据集仍然是自然语言处理领域的主要瓶颈之一,这些语言由于资源稀缺和语言特性,使得大规模标注成本高昂且可能存在不一致性。为应对这些挑战,我们提出并评估了一种新颖方法,该方法利用维基百科和维基数据作为弱监督的结构化数据源。通过利用维基百科文章内的内部链接,我们根据其对应的维基数据条目推断实体类型,从而以最少的人工干预生成初始标注。由于此类链接的可靠性并不一致,我们采用并比较了多种LLM来识别并仅保留高质量标注句子,以此降低噪声影响。最终构建的语料库规模约为当前可用卢森堡语NER数据集的五倍,并在实体类别上提供了更广泛、更均衡的覆盖范围,为多语言及低资源NER研究提供了重要的新资源。