Large language models (LLMs) have shown impressive ability for open-domain NLP tasks. However, LLMs are sometimes too footloose for natural language understanding (NLU) tasks which always have restricted output and input format. Their performances on NLU tasks are highly related to prompts or demonstrations and are shown to be poor at performing several representative NLU tasks, such as event extraction and entity typing. To this end, we present SeqGPT, a bilingual (i.e., English and Chinese) open-source autoregressive model specially enhanced for open-domain natural language understanding. We express all NLU tasks with two atomic tasks, which define fixed instructions to restrict the input and output format but still ``open'' for arbitrarily varied label sets. The model is first instruction-tuned with extremely fine-grained labeled data synthesized by ChatGPT and then further fine-tuned by 233 different atomic tasks from 152 datasets across various domains. The experimental results show that SeqGPT has decent classification and extraction ability, and is capable of performing language understanding tasks on unseen domains. We also conduct empirical studies on the scaling of data and model size as well as on the transfer across tasks. Our model is accessible at https://github.com/Alibaba-NLP/SeqGPT.
翻译:大型语言模型在开放域自然语言处理任务中展现出令人瞩目的能力。然而,在处理输出与输入格式均受限的自然语言理解任务时,大模型有时显得过于灵活。其在不同自然语言理解任务上的表现高度依赖于提示或示例,并在事件抽取、实体类型识别等代表性任务中表现欠佳。为此,我们提出SeqGPT——一种针对开放域自然语言理解进行专门增强的双语(即中英文)开源自回归模型。我们将所有自然语言理解任务归纳为两类原子任务,通过定义固定指令约束输入输出格式,同时保持对任意标签集的“开放性”。该模型首先使用ChatGPT合成的极细粒度标注数据进行指令微调,随后基于152个数据集涵盖的233个跨领域原子任务进一步精调。实验结果表明,SeqGPT具备出色的分类与抽取能力,能够处理未见领域的语言理解任务。我们还开展了数据规模、模型规模扩展以及任务间迁移的实证研究。本模型已开源,可通过https://github.com/Alibaba-NLP/SeqGPT获取。