Bridging the significant gap between large language model's English and non-English performance presents a great challenge. While some previous studies attempt to mitigate this gap with translated training data, the recently proposed question alignment approach leverages the model's English expertise to improve multilingual performance with minimum usage of expensive, error-prone translation. In this paper, we explore how broadly this method can be applied by examining its effects in reasoning with executable code and reasoning with common sense. We also explore how to apply this approach efficiently to extremely large language models using proxy-tuning. Experiment results on multilingual reasoning benchmarks mGSM, mSVAMP and xCSQA demonstrate that the question alignment approach can be used to boost multilingual performance across diverse reasoning scenarios, model families, and sizes. For instance, when applied to the LLaMA2 models, our method brings an average accuracy improvements of 12.2% on mGSM even with the 70B model. To understand the mechanism of its success, we analyze representation space, chain-of-thought and translation data scales, which reveals how question translation training strengthens language alignment within LLMs and shapes their working patterns.
翻译:缩小大型语言模型在英语与非英语性能间的显著差距是一项重大挑战。尽管先前的一些研究尝试通过翻译训练数据来缓解这一差距,近期提出的问题对齐方法则利用模型的英语专长,以最低限度使用昂贵且易错的翻译来提升多语言性能。本文通过考察该方法在基于可执行代码的推理与常识推理中的效果,探索其应用的广泛性。同时,我们研究如何利用代理调优技术,高效地将此方法应用于超大规模语言模型。在多语言推理基准测试mGSM、mSVAMP和xCSQA上的实验结果表明,问题对齐方法能够有效提升多语言性能,适用于不同的推理场景、模型系列及规模。例如,将该方法应用于LLaMA2模型时,即使在70B规模模型上,mGSM的平均准确率提升了12.2%。为理解其成功机制,我们分析了表征空间、思维链及翻译数据规模,揭示了问题翻译训练如何强化语言模型内部的语言对齐并塑造其工作模式。