We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model's instructive abilities. The model is accessible at hugging face: https://huggingface.co/arcee-ai/Llama-3-SEC-Base, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.
翻译:我们在SEC数据上对Meta-Llama-3-70B-Instruct模型进行了广泛的领域适应实验,探究其在通用基准和领域特定基准上的性能。我们的研究重点包括持续预训练(CPT)和模型融合,旨在增强模型的领域特定能力,同时减轻灾难性遗忘。通过本研究,我们评估了将金融监管数据集成到强大语言模型中的影响,并检验了我们的模型融合技术在保持和提升模型指令能力方面的有效性。该模型可通过hugging face访问:https://huggingface.co/arcee-ai/Llama-3-SEC-Base, arcee-ai/Llama-3-SEC-Base。这是我们最终模型的一个中间检查点,迄今已处理200亿个词元。完整模型仍在训练过程中。这是一份预印本技术报告,包含全面的评估以理解整个流程。