Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM's reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting "Lost in Translation," where translation steps lead to errors that would have been avoided by question's language reasoning.
翻译:大型推理模型(LRMs)在数学、科学及其他问答任务上表现出色,但其多语言推理能力仍未得到充分探索。当面对非英语问题时,LRMs往往默认使用英语进行推理,这引发了关于模型可解释性以及对语言文化细微差别处理能力的担忧。我们系统比较了LRM使用英语与问题原文语言进行推理的表现。评估涵盖两项任务:MGSM与GPQA Diamond。除衡量答案准确率外,我们还分析了推理轨迹中的认知特征。研究发现:英语推理轨迹中这些认知行为的出现频率显著更高,且使用英语推理通常能获得更高的最终答案准确率,随着任务复杂度提升,这种性能差距会进一步扩大。然而,这种以英语为中心的推理策略存在一个关键缺陷——容易陷入“翻译迷途”的失效模式:翻译步骤可能导致本可通过问题原文语言推理避免的错误。