Fine-grained few-shot entity extraction in the chemical domain faces two unique challenges. First, compared with entity extraction tasks in the general domain, sentences from chemical papers usually contain more entities. Moreover, entity extraction models usually have difficulty extracting entities of long-tailed types. In this paper, we propose Chem-FINESE, a novel sequence-to-sequence (seq2seq) based few-shot entity extraction approach, to address these two challenges. Our Chem-FINESE has two components: a seq2seq entity extractor to extract named entities from the input sentence and a seq2seq self-validation module to reconstruct the original input sentence from extracted entities. Inspired by the fact that a good entity extraction system needs to extract entities faithfully, our new self-validation module leverages entity extraction results to reconstruct the original input sentence. Besides, we design a new contrastive loss to reduce excessive copying during the extraction process. Finally, we release ChemNER+, a new fine-grained chemical entity extraction dataset that is annotated by domain experts with the ChemNER schema. Experiments in few-shot settings with both ChemNER+ and CHEMET datasets show that our newly proposed framework has contributed up to 8.26% and 6.84% absolute F1-score gains respectively.
翻译:论文摘要:化学领域的细粒度少样本实体抽取面临两个独特挑战:其一,相较于通用领域的实体抽取任务,化学论文中的句子通常包含更多实体;其二,实体抽取模型往往难以抽取长尾类型的实体。针对这两个问题,本文提出Chem-FINESE——一种新颖的基于序列到序列(seq2seq)的少样本实体抽取方法。该方法包含两个组件:用于从输入句子中抽取命名实体的seq2seq实体抽取器,以及用于从抽取实体重构原始输入句子的seq2seq自验证模块。基于优质实体抽取系统需忠实抽取实体的启发,我们设计的自验证模块利用实体抽取结果重构原始输入句子。此外,我们提出一种新型对比损失函数以减少抽取过程中的过度复制现象。最后,我们发布了ChemNER+数据集——一个由领域专家依据ChemNER模式标注的新型细粒度化学实体抽取数据集。在ChemNER+与CHEMET数据集上的少样本实验表明,我们的新框架在绝对F1分数上分别贡献了8.26%和6.84%的提升。