Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps. Our findings demonstrate that current multilingual evaluation practices provide an incomplete picture of model reasoning capabilities and highlight the need for reasoning-aware evaluation frameworks.
翻译:大型语言模型通过思维链提示展现出强大的推理能力,但其推理质量是否能在不同语言间迁移仍缺乏深入研究。我们提出了一个人工验证的评估框架,用于检验模型生成的推理过程是否在逻辑上支持其跨语言结论。通过分析来自GlobalMMLU数据集的6种语言、6个前沿模型生成的6.5万条推理轨迹,我们发现了一个关键盲点:尽管模型在任务准确率上表现优异,其推理过程可能无法有效支撑所得结论。非拉丁文字书写的语言中,推理过程与结论之间的错位程度至少是拉丁文字语言的两倍。我们通过人工标注构建了错误分类体系以刻画这些缺陷,发现其主要源于证据性错误(如无依据的主张、模糊事实陈述),其次是逻辑推理步骤的失效。本研究结果表明,当前的多语言评估实践未能完整反映模型的真实推理能力,凸显了建立具备推理感知能力的评估框架的迫切需求。