Pretrained Large Language Models (LLM) such as ChatGPT, Claude, etc. have demonstrated strong capabilities in various fields of natural language generation. However, there are still many problems when using LLM in specialized domain-specific fields. When using generative AI to process downstream tasks, a common approach is to add new knowledge (e.g., private domain knowledge, cutting-edge information) to a pretrained model through continued training or fine-tuning. However, whether there is a universal paradigm for domain adaptation training is still an open question. In this article, we proposed Information Gain Optimized Tokenizer (IGOT), which analyzes the special token set of downstream tasks, constructs a new subset using heuristic function $\phi$ with the special token and its information gain, to build new domain-specific tokenizer, and continues pretraining on the downstream task data. We explored the many positive effects of this method's customized tokenizer on domain-adaptive pretraining and verified this method can perform better than the ordinary method of just collecting data and fine-tuning. Based on our experiment, the continued pretraining process of IGOT with LLaMA-7B achieved 11.9\% token saving, 12.2\% training time saving, and 5.8\% maximum GPU VRAM usage saving, combined with the T5 model, we can even reach a 31.5\% of training time saving, making porting general generative AI to specific domains more effective than before. In domain-specific tasks, supervised $IGOT_\tau$ shows great performance on reducing both the convergence radius and convergence point during keep pretraining.
翻译:预训练大型语言模型(LLM),如ChatGPT、Claude等,已在自然语言生成的多个领域展现出强大能力。然而,在特定专业领域使用LLM时仍存在诸多问题。当利用生成式AI处理下游任务时,常见做法是通过持续训练或微调向预训练模型添加新知识(例如私有领域知识、前沿信息)。但领域自适应训练是否存在通用范式仍是一个未解问题。本文提出信息增益优化分词器(IGOT),通过分析下游任务的特有标记集,利用启发式函数$\phi$结合特有标记及其信息增益构建新的子集,从而建立领域专用分词器,并在下游任务数据上持续预训练。我们探究了该方法定制化分词器对领域自适应预训练的诸多积极影响,并验证该方法优于仅收集数据进行微调的常规方法。基于实验,采用IGOT对LLaMA-7B进行持续预训练实现了11.9%的标记节省、12.2%的训练时间节省以及5.8%的GPU显存峰值使用降低;结合T5模型时,训练时间节省率甚至可达31.5%,这使得将通用生成式AI迁移至特定领域的效率显著提升。在领域任务中,监督式$IGOT_\tau$在持续预训练过程中展现出缩小收敛半径与收敛点的卓越性能。