High-quality, large-scale instructions are crucial for aligning large language models (LLMs), however, there is a severe shortage of instruction in the field of natural language understanding (NLU). Previous works on constructing NLU instructions mainly focus on information extraction (IE), neglecting tasks such as machine reading comprehension, question answering, and text classification. Furthermore, the lack of diversity in the data has led to a decreased generalization ability of trained LLMs in other NLU tasks and a noticeable decline in the fundamental model's general capabilities. To address this issue, we propose Hum, a large-scale, high-quality synthetic instruction corpus for NLU tasks, designed to enhance the NLU capabilities of LLMs. Specifically, Hum includes IE (either close IE or open IE), machine reading comprehension, text classification, and instruction generalist tasks, thereby enriching task diversity. Additionally, we introduce a human-LLMs collaborative mechanism to synthesize instructions, which enriches instruction diversity by incorporating guidelines, preference rules, and format variants. We conduct extensive experiments on 5 NLU tasks and 28 general capability evaluation datasets for LLMs. Experimental results show that Hum enhances the NLU capabilities of six LLMs by an average of 3.1\%, with no significant decline observed in other general capabilities.
翻译:高质量、大规模指令对于对齐大语言模型至关重要,然而在自然语言理解领域存在严重的指令短缺问题。现有构建NLU指令的工作主要集中于信息抽取任务,忽视了机器阅读理解、问答与文本分类等任务。此外,数据多样性的缺失导致训练后的大语言模型在其他NLU任务上的泛化能力下降,且基础模型的通用能力出现明显衰退。为解决此问题,我们提出了Hum——一个面向NLU任务的大规模高质量合成指令语料库,旨在增强大语言模型的NLU能力。具体而言,Hum涵盖信息抽取(封闭式与开放式)、机器阅读理解、文本分类及指令通才任务,从而丰富了任务多样性。此外,我们引入人机协同指令合成机制,通过整合指导原则、偏好规则与格式变体来增强指令多样性。我们在5项NLU任务和28个大语言模型通用能力评估数据集上进行了广泛实验。结果表明,Hum将六种大语言模型的NLU能力平均提升了3.1%,且在其他通用能力上未观察到显著衰退。