Lately, instruction-based techniques have made significant strides in improving performance in few-shot learning scenarios. They achieve this by bridging the gap between pre-trained language models and fine-tuning for specific downstream tasks. Despite these advancements, the performance of Large Language Models (LLMs) in information extraction tasks like Named Entity Recognition (NER), using prompts or instructions, still falls short of supervised baselines. The reason for this performance gap can be attributed to the fundamental disparity between NER and LLMs. NER is inherently a sequence labeling task, where the model must assign entity-type labels to individual tokens within a sentence. In contrast, LLMs are designed as a text generation task. This distinction between semantic labeling and text generation leads to subpar performance. In this paper, we transform the NER task into a text-generation task that can be readily adapted by LLMs. This involves enhancing source sentences with task-specific instructions and answer choices, allowing for the identification of entities and their types within natural language. We harness the strength of LLMs by integrating supervised learning within them. The goal of this combined strategy is to boost the performance of LLMs in extraction tasks like NER while simultaneously addressing hallucination issues often observed in LLM-generated content. A novel corpus Contract NER comprising seven frequently observed contract categories, encompassing named entities associated with 18 distinct legal entity types is released along with our baseline models. Our models and dataset are available to the community for future research * .
翻译:近期,基于指令的技术在少样本学习场景中显著提升了性能。该技术通过弥合预训练语言模型与针对特定下游任务进行微调之间的差距来实现这一目标。然而,尽管取得了这些进展,大型语言模型(LLMs)在命名实体识别(NER)等信息抽取任务中使用提示或指令的性能仍落后于有监督基线方法。造成这一性能差距的原因可归结为NER与LLMs之间根本性的差异:NER本质上是序列标注任务,要求模型为句子中的每个词元分配实体类型标签;而LLMs则被设计为文本生成任务。这种语义标注与文本生成之间的本质差异导致了性能欠佳。本文通过将NER任务转化为LLMs易于适配的文本生成任务,具体方法是用任务特定指令和答案选项增强源句子,从而在自然语言中识别实体及其类型。我们通过将监督学习集成到LLMs中发挥其优势,该组合策略旨在提升LLMs在NER等抽取任务中的性能,同时解决LLMs生成内容中常见的幻觉问题。我们发布了包含七个常见合同类别的全新合同NER语料库,涵盖18种法律实体类型的命名实体,并提供了基线模型。我们的模型和数据集已向社区开放,供未来研究使用。