Recognizing entities in texts is a central need in many information-seeking scenarios, and indeed, Named Entity Recognition (NER) is arguably one of the most successful examples of a widely adopted NLP task and corresponding NLP technology. Recent advances in large language models (LLMs) appear to provide effective solutions (also) for NER tasks that were traditionally handled with dedicated models, often matching or surpassing the abilities of the dedicated models. Should NER be considered a solved problem? We argue to the contrary: the capabilities provided by LLMs are not the end of NER research, but rather an exciting beginning. They allow taking NER to the next level, tackling increasingly more useful, and increasingly more challenging, variants. We present three variants of the NER task, together with a dataset to support them. The first is a move towards more fine-grained -- and intersectional -- entity types. The second is a move towards zero-shot recognition and extraction of these fine-grained types based on entity-type labels. The third, and most challenging, is the move from the recognition setup to a novel retrieval setup, where the query is a zero-shot entity type, and the expected result is all the sentences from a large, pre-indexed corpus that contain entities of these types, and their corresponding spans. We show that all of these are far from being solved. We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types, to facilitate research towards all of these three goals.
翻译:在文本中识别实体是许多信息检索场景的核心需求,事实上,命名实体识别(NER)可以说是最成功的广泛采用的NLP任务及相应NLP技术之一。大语言模型(LLM)的最新进展似乎也为传统上使用专用模型处理的NER任务提供了有效的解决方案,这些方案往往达到甚至超越了专用模型的能力。NER是否应被视为一个已解决的问题?我们认为恰恰相反:LLMs的能力并非NER研究的终点,而是一个激动人心的开端。它们使NER能够迈向更高层次,应对日益实用且更具挑战性的变体。我们提出NER任务的三种变体,并提供相应的数据集以支持它们。第一种变体是向更细粒度——且具有交叉性——的实体类型迈进。第二种变体是基于实体类型标签,对这些细粒度类型进行零样本识别与提取。第三种变体最具挑战性,即从识别框架转向全新的检索框架:查询是一个零样本实体类型,预期结果是从大规模预索引语料库中检索出所有包含这些类型实体及其对应跨度的句子。我们证明这些变体远未得到解决。我们提供了一个包含500种实体类型、共400万段银标准标注的大规模语料库,以促进这三个目标方向的研究。