大型语言模型能否准确评估远程监督的命名实体标签？构建JudgeWEL数据集 (Do LLMs Judge Distantly Supervised Named Entity Labels Well? Constructing the JudgeWEL Dataset)

We present judgeWEL, a dataset for named entity recognition (NER) in Luxembourgish, automatically labelled and subsequently verified using large language models (LLM) in a novel pipeline. Building datasets for under-represented languages remains one of the major bottlenecks in natural language processing, where the scarcity of resources and linguistic particularities make large-scale annotation costly and potentially inconsistent. To address these challenges, we propose and evaluate a novel approach that leverages Wikipedia and Wikidata as structured sources of weak supervision. By exploiting internal links within Wikipedia articles, we infer entity types based on their corresponding Wikidata entries, thereby generating initial annotations with minimal human intervention. Because such links are not uniformly reliable, we mitigate noise by employing and comparing several LLMs to identify and retain only high-quality labelled sentences. The resulting corpus is approximately five times larger than the currently available Luxembourgish NER dataset and offers broader and more balanced coverage across entity categories, providing a substantial new resource for multilingual and low-resource NER research.

翻译：我们提出了judgeWEL——一个卢森堡语命名实体识别（NER）数据集，该数据集通过一种新颖的流水线自动标注，并随后使用大型语言模型（LLM）进行验证。为资源匮乏语言构建数据集仍然是自然语言处理领域的主要瓶颈之一，这些语言由于资源稀缺和语言特性，使得大规模标注成本高昂且可能存在不一致性。为应对这些挑战，我们提出并评估了一种新颖方法，该方法利用维基百科和维基数据作为弱监督的结构化数据源。通过利用维基百科文章内的内部链接，我们根据其对应的维基数据条目推断实体类型，从而以最少的人工干预生成初始标注。由于此类链接的可靠性并不一致，我们采用并比较了多种LLM来识别并仅保留高质量标注句子，以此降低噪声影响。最终构建的语料库规模约为当前可用卢森堡语NER数据集的五倍，并在实体类别上提供了更广泛、更均衡的覆盖范围，为多语言及低资源NER研究提供了重要的新资源。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大型语言模型（LLM）赋能的知识图谱构建：综述

专知会员服务

54+阅读 · 2025年10月24日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日