In this work, we address the NER problem by splitting it into two logical sub-tasks: (1) Span Detection which simply extracts entity mention spans irrespective of entity type; (2) Span Classification which classifies the spans into their entity types. Further, we formulate both sub-tasks as question-answering (QA) problems and produce two leaner models which can be optimized separately for each sub-task. Experiments with four cross-domain datasets demonstrate that this two-step approach is both effective and time efficient. Our system, SplitNER outperforms baselines on OntoNotes5.0, WNUT17 and a cybersecurity dataset and gives on-par performance on BioNLP13CG. In all cases, it achieves a significant reduction in training time compared to its QA baseline counterpart. The effectiveness of our system stems from fine-tuning the BERT model twice, separately for span detection and classification. The source code can be found at https://github.com/c3sr/split-ner.
翻译:本研究将命名实体识别(NER)问题分解为两个逻辑子任务:(1)跨度检测——仅提取实体提及跨度,不区分实体类型;(2)跨度分类——将跨度的实体类型进行分类。进一步,我们将这两个子任务均建模为问答(QA)问题,并分别针对每个子任务构建了可独立优化的精简模型。在四个跨领域数据集上的实验表明,该两步方法既有效又具有时间效率。我们提出的SplitNER系统在OntoNotes5.0、WNUT17及一个网络安全数据集上优于基线方法,并在BioNLP13CG上取得了相当的性能。在所有案例中,相比直接使用QA基线模型,本系统的训练时间显著减少。该系统的有效性源于对BERT模型进行两次独立微调:分别用于跨度检测与跨度分类。源代码见https://github.com/c3sr/split-ner。