Named Entity Recognition (NER) remains challenging due to the complex entities, like nested, overlapping, and discontinuous entities. Existing approaches, such as sequence-to-sequence (Seq2Seq) generation and span-based classification, have shown impressive performance on various NER subtasks, but they are difficult to scale to datasets with longer input text because of either exposure bias issue or inefficient computation. In this paper, we propose a novel Sequence-to-Forest generation paradigm, S2F-NER, which can directly extract entities in sentence via a Forest decoder that decode multiple entities in parallel rather than sequentially. Specifically, our model generate each path of each tree in forest autoregressively, where the maximum depth of each tree is three (which is the shortest feasible length for complex NER and is far smaller than the decoding length of Seq2Seq). Based on this novel paradigm, our model can elegantly mitigates the exposure bias problem and keep the simplicity of Seq2Seq. Experimental results show that our model significantly outperforms the baselines on three discontinuous NER datasets and on two nested NER datasets, especially for discontinuous entity recognition.
翻译:命名实体识别(NER)因嵌套、重叠及不连续实体等复杂结构仍具挑战性。现有方法如序列到序列(Seq2Seq)生成与基于跨度的分类虽在各类NER子任务中表现优异,但因暴露偏差问题或计算效率低下,难以扩展至长输入文本数据集。本文提出一种新颖的序列到森林生成范式S2F-NER,通过森林解码器并行而非顺序解码多个实体,从而直接抽取句内实体。具体而言,模型自回归生成森林中每棵树的每条路径,其中每棵树的最大深度为三(该深度是复杂NER问题可达的最小可行长度,且远小于Seq2Seq的解码长度)。基于该范式,模型可优雅缓解暴露偏差问题,同时保持Seq2Seq的简洁性。实验结果表明,本模型在三个不连续NER数据集和两个嵌套NER数据集上显著优于基线方法,尤其在不连续实体识别任务中表现突出。