Large language models (LLMs) have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving. While many strategies and datasets to enhance LLMs' mathematics are developed, it remains a challenge to simultaneously maintain and improve both language and mathematical capabilities in deployed LLM systems.In this work, we tailor the Self-Critique pipeline, which addresses the challenge in the feedback learning stage of LLM alignment. We first train a general Math-Critique model from the LLM itself to provide feedback signals. Then, we sequentially employ rejective fine-tuning and direct preference optimization over the LLM's own generations for data collection. Based on ChatGLM3-32B, we conduct a series of experiments on both academic and our newly created challenging dataset, MathUserEval. Results show that our pipeline significantly enhances the LLM's mathematical problem-solving while still improving its language ability, outperforming LLMs that could be two times larger. Related techniques have been deployed to ChatGLM\footnote{\url{https://chatglm.cn}}, an online serving LLM. Related evaluation dataset and scripts are released at \url{https://github.com/THUDM/ChatGLM-Math}.
翻译:大型语言模型在人类语言理解方面表现出色,但在需要数学问题求解的实际应用中仍存在困难。尽管已有诸多增强大型语言模型数学能力的策略和数据集,但在部署的系统中同时维持并提升语言与数学能力仍具挑战性。本研究针对大型语言模型对齐中的反馈学习阶段,定制了自我批评流水线。首先从大型语言模型自身训练通用数学批评模型以提供反馈信号,随后依次采用拒绝性微调和直接偏好优化方法,基于模型自身生成结果进行数据收集。基于ChatGLM3-32B模型,我们在学术数据集和新构建的具有挑战性的MathUserEval数据集上开展系列实验。结果表明,本流水线在显著提升模型数学问题求解能力的同时,其语言能力亦获得提升,性能超越体积大两倍的大型语言模型。相关技术已部署至在线服务模型ChatGLM(https://chatglm.cn)。相关评估数据集和脚本发布于https://github.com/THUDM/ChatGLM-Math。