Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still exhibit a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have been made to address this gap, its underlying causes remain largely unexplored. In this work, we show that this gap primarily stems from failures in language understanding-specifically, the model's inability to translate multilingual inputs into the language dominating its reasoning traces (typically English). As identifying understanding failures can enable targeted mitigation of the gap, we evaluate a range of detection methods and find that understanding failures are detectable to a meaningful extent, with supervised approaches performing best. Building on this, we propose Selective Translation, a strategy that incorporates an English translation into the initial reasoning trace only when an understanding failure is detected. Experimental results using Qwen3-4B show that Selective Translation substantially bridges the multilingual reasoning gap, achieving near full-translation performance while translating only about 20% of inputs. Together, our results show that failures in language understanding are the primary driver of the multilingual reasoning gap and can be detected and selectively mitigated, clarifying its origin and suggesting a path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis
翻译:推理语言模型(RLMs)在复杂推理任务上表现出色,然而它们仍存在多语言推理差距,即在高资源语言上的表现优于低资源语言。尽管近期已有研究尝试解决这一差距,但其根本原因在很大程度上仍未得到探索。本研究表明,这一差距主要源于语言理解失败——具体而言,是模型无法将多语言输入翻译至主导其推理轨迹的语言(通常是英语)。由于识别理解失败能够有针对性地缩小差距,我们评估了一系列检测方法,发现理解失败在相当程度上是可检测的,其中监督学习方法表现最佳。基于此,我们提出了选择性翻译策略,该策略仅在检测到理解失败时,才将英文翻译纳入初始推理轨迹。使用Qwen3-4B的实验结果表明,选择性翻译显著缩小了多语言推理差距,在仅翻译约20%输入的情况下,达到了接近全翻译的性能。综上,我们的研究结果表明,语言理解失败是多语言推理差距的主要驱动因素,并且可以被检测并有选择性地缓解,这阐明了其根源,并为实现更公平的多语言推理指明了路径。我们的代码和数据公开于 https://github.com/deokhk/RLM_analysis