Sharing knowledge between information extraction tasks has always been a challenge due to the diverse data formats and task variations. Meanwhile, this divergence leads to information waste and increases difficulties in building complex applications in real scenarios. Recent studies often formulate IE tasks as a triplet extraction problem. However, such a paradigm does not support multi-span and n-ary extraction, leading to weak versatility. To this end, we reorganize IE problems into unified multi-slot tuples and propose a universal framework for various IE tasks, namely Mirror. Specifically, we recast existing IE tasks as a multi-span cyclic graph extraction problem and devise a non-autoregressive graph decoding algorithm to extract all spans in a single step. It is worth noting that this graph structure is incredibly versatile, and it supports not only complex IE tasks, but also machine reading comprehension and classification tasks. We manually construct a corpus containing 57 datasets for model pretraining, and conduct experiments on 30 datasets across 8 downstream tasks. The experimental results demonstrate that our model has decent compatibility and outperforms or reaches competitive performance with SOTA systems under few-shot and zero-shot settings. The code, model weights, and pretraining corpus are available at https://github.com/Spico197/Mirror .
翻译:不同信息抽取任务之间的知识共享因数据格式多样性和任务差异而始终面临挑战。与此同时,这种差异性不仅导致信息浪费,还增加了现实场景中构建复杂应用的难度。近期研究常将信息抽取任务形式化为三元组抽取问题,但此类范式无法支持多元组与嵌套式抽取,导致通用性不足。为此,我们将信息抽取问题重构为统一的多槽元组,并提出面向多种信息抽取任务的通用框架Mirror。具体而言,我们将现有信息抽取任务重新定义为多跨度循环图抽取问题,并设计了一种非自回归图解码算法以一步完成所有跨度抽取。值得注意的是,该图结构具有极强的通用性,不仅支持复杂的信息抽取任务,还能涵盖机器阅读理解与分类任务。我们手工构建了包含57个数据集的语料库进行模型预训练,并在8个下游任务的30个数据集上开展实验。结果表明,本模型具备良好的兼容性,在少样本与零样本场景下均达到或超越当前最优系统的性能。相关代码、模型权重及预训练语料库已开源至 https://github.com/Spico197/Mirror。