Text structuralization is one of the important fields of natural language processing (NLP) consists of information extraction (IE) and structure formalization. However, current studies of text structuralization suffer from a shortage of manually annotated high-quality datasets from different domains and languages, which require specialized professional knowledge. In addition, most IE methods are designed for a specific type of structured data, e.g., entities, relations, and events, making them hard to generalize to others. In this work, we propose a simple and efficient approach to instruct large language model (LLM) to extract a variety of structures from texts. More concretely, we add a prefix and a suffix instruction to indicate the desired IE task and structure type, respectively, before feeding the text into a LLM. Experiments on two LLMs show that this approach can enable language models to perform comparable with other state-of-the-art methods on datasets of a variety of languages and knowledge, and can generalize to other IE sub-tasks via changing the content of instruction. Another benefit of our approach is that it can help researchers to build datasets in low-source and domain-specific scenarios, e.g., fields in finance and law, with low cost.
翻译:文本结构化是自然语言处理(NLP)的重要领域之一,包含信息抽取(IE)和结构形式化。然而,当前文本结构化研究面临人工标注的高质量跨领域、跨语言数据集匮乏的问题,这类标注需要专业领域知识。此外,大多数IE方法专门针对特定类型的结构化数据(如实体、关系、事件)设计,难以泛化至其他类型。本文提出一种简单高效的方法,通过指令引导大型语言模型(LLM)从文本中抽取多种结构。具体而言,我们在输入文本前添加前缀和后缀指令,分别指示所需IE任务和结构类型。在两个LLM上的实验表明,该方法可使语言模型在多种语言和知识领域的数据集上达到与现有最优方法相当的性能,并通过改变指令内容泛化至其他IE子任务。该方法的另一优势在于,能以低成本帮助研究者在低资源及特定领域场景(如金融、法律领域)构建数据集。