Large Language Models (LLMs) have shown superior performance in various applications and fields. To achieve better performance on specialized domains such as law and advertisement, LLMs are often continue pre-trained on in-domain data. However, existing approaches suffer from two major issues. First, in-domain data are scarce compared with general domain-agnostic data. Second, data used for continual pre-training are not task-aware, such that they may not be helpful to downstream applications. We propose TRAIT, a task-oriented in-domain data augmentation framework. Our framework is divided into two parts: in-domain data selection and task-oriented synthetic passage generation. The data selection strategy identifies and selects a large amount of in-domain data from general corpora, and thus significantly enriches domain knowledge in the continual pre-training data. The synthetic passages contain guidance on how to use domain knowledge to answer questions about downstream tasks. By training on such passages, the model aligns with the need of downstream applications. We adapt LLMs to two domains: advertisement and math. On average, TRAIT improves LLM performance by 8% in the advertisement domain and 7.5% in the math domain.
翻译:大型语言模型(LLMs)已在多种应用和领域中展现出卓越性能。为了在法律、广告等专业领域取得更佳表现,通常会在领域内数据上对LLMs进行持续预训练。然而,现有方法存在两个主要问题:首先,与通用领域无关数据相比,领域内数据较为稀缺;其次,用于持续预训练的数据并非任务导向型,可能对下游应用助益有限。本文提出TRAIT——一种面向任务的领域内数据增强框架。该框架包含两部分:领域内数据筛选与面向任务的合成文本生成。数据筛选策略从通用语料库中识别并选取大量领域内数据,从而显著增强持续预训练数据中的领域知识;合成文本则包含如何运用领域知识解答下游任务问题的指导信息。通过在此类文本上进行训练,模型能够与下游应用需求对齐。我们将LLMs适配于广告和数学两个领域:平均而言,TRAIT使LLMs在广告领域的性能提升8%,在数学领域提升7.5%。