Open Information Extraction (OpenIE) is a fundamental yet challenging task in Natural Language Processing, which involves extracting all triples (subject, predicate, object) from a given sentence. While labeling-based methods have their merits, generation-based techniques offer unique advantages, such as the ability to generate tokens not present in the original sentence. However, these generation-based methods often require a significant amount of training data to learn the task form of OpenIE and substantial training time to overcome slow model convergence due to the order penalty. In this paper, we introduce a novel framework, OK-IE, that ingeniously transforms the task form of OpenIE into the pre-training task form of the T5 model, thereby reducing the need for extensive training data. Furthermore, we introduce an innovative concept of Anchor to control the sequence of model outputs, effectively eliminating the impact of order penalty on model convergence and significantly reducing training time. Experimental results indicate that, compared to previous SOTA methods, OK-IE requires only 1/100 of the training data (900 instances) and 1/120 of the training time (3 minutes) to achieve comparable results.
翻译:开放信息抽取(OpenIE)是自然语言处理中一项基础且具有挑战性的任务,其目标是从给定句子中抽取出所有三元组(主语、谓语、宾语)。尽管基于标注的方法有其优点,但基于生成的技术提供了独特的优势,例如能够生成原始句子中不存在的词元。然而,这些基于生成的方法通常需要大量的训练数据来学习OpenIE的任务形式,并且需要大量的训练时间来克服由于顺序惩罚导致的模型收敛缓慢问题。本文中,我们引入了一个新颖的框架OK-IE,它巧妙地将OpenIE的任务形式转化为T5模型的预训练任务形式,从而减少了对大量训练数据的需求。此外,我们引入了一个创新的“锚点”概念来控制模型输出的序列,有效消除了顺序惩罚对模型收敛的影响,并显著减少了训练时间。实验结果表明,与之前的SOTA方法相比,OK-IE仅需1/100的训练数据(900个实例)和1/120的训练时间(3分钟)即可达到可比的结果。