Information Extraction (IE) aims to extract structural knowledge (e.g., entities, relations, events) from natural language texts, which brings challenges to existing methods due to task-specific schemas and complex text expressions. Code, as a typical kind of formalized language, is capable of describing structural knowledge under various schemas in a universal way. On the other hand, Large Language Models (LLMs) trained on both codes and texts have demonstrated powerful capabilities of transforming texts into codes, which provides a feasible solution to IE tasks. Therefore, in this paper, we propose a universal retrieval-augmented code generation framework based on LLMs, called Code4UIE, for IE tasks. Specifically, Code4UIE adopts Python classes to define task-specific schemas of various structural knowledge in a universal way. By so doing, extracting knowledge under these schemas can be transformed into generating codes that instantiate the predefined Python classes with the information in texts. To generate these codes more precisely, Code4UIE adopts the in-context learning mechanism to instruct LLMs with examples. In order to obtain appropriate examples for different tasks, Code4UIE explores several example retrieval strategies, which can retrieve examples semantically similar to the given texts. Extensive experiments on five representative IE tasks across nine datasets demonstrate the effectiveness of the Code4UIE framework.
翻译:信息抽取旨在从自然语言文本中提取结构化知识(如实体、关系、事件),由于任务特定模式与复杂文本表达,现有方法面临挑战。代码作为一种典型的形式化语言,能够以通用方式描述多种模式下的结构化知识。另一方面,同时基于代码和文本训练的大语言模型展现出将文本转化为代码的强大能力,这为信息抽取任务提供了可行方案。因此,本文提出一种基于大语言模型的通用检索增强代码生成框架Code4UIE,用于信息抽取任务。具体而言,Code4UIE采用Python类以通用方式定义各类结构化知识的任务特定模式。通过这种方式,将符合这些模式的知识抽取转化为生成代码的过程,即用文本中的信息实例化预定义的Python类。为更精确地生成这些代码,Code4UIE采用上下文学习机制,以示例指导大语言模型。为获取适用于不同任务的适当示例,Code4UIE探索了多种示例检索策略,可检索与给定文本语义相似的示例。在涵盖九个数据集的五个代表性信息抽取任务上的广泛实验,验证了Code4UIE框架的有效性。