This study investigates how to efficiently build a domain-specialized large language model (LLM) for statistics using the lightweight LLaMA-3.2-3B family as the foundation model (FM). We systematically compare three multi-stage training pipelines--starting from a base FM with no instruction-following capability, a base FM augmented with post-hoc instruction tuning, and an instruction-tuned FM with strong general reasoning abilities--across continual pretraining, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task fine-tuning (DTFT). Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA-3.2-3B-Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade-offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that DTFT must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common-sense reasoning, and statistical expertise, offering a practical blueprint for developing resource-efficient statistical LLMs. The code is available at https://github.com/HuangDLab/StatLLaMA.
翻译:本研究探讨了如何以轻量级LLaMA-3.2-3B系列作为基础模型,高效构建面向统计学的领域专业化大语言模型。我们系统比较了三种多阶段训练流程:从未具备指令跟随能力的基础模型出发、从经过事后指令微调增强的基础模型出发,以及从具备强大通用推理能力的指令微调模型出发,在持续预训练、监督微调、基于人类反馈的强化学习偏好对齐以及下游任务微调等阶段进行全面评估。结果表明,从基础模型开始的流程即使经过大量指令微调、监督微调或强化学习对齐,也无法形成有效的统计推理能力。相反,以LLaMA-3.2-3B-Instruct为起点则能实现有效的领域专业化。对监督微调变体的综合评估揭示了领域专业知识与通用推理能力之间明确的权衡关系。我们进一步证明直接偏好优化能够提供稳定有效的强化学习偏好对齐。最后,我们发现对于高度优化的模型,下游任务微调必须以极低的强度进行,以避免灾难性遗忘。最终模型StatLLaMA在数学推理、常识推理和统计专业能力的基准测试中均表现出强大且均衡的性能,为开发资源高效的统计大语言模型提供了实用蓝图。代码发布于https://github.com/HuangDLab/StatLLaMA。