Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

Guijin Son,Donghun Yang,Hitesh Laxmichand Patel,Amit Agarwal,Hyunwoo Ko,Chanuk Lim,Srikant Panda,Minhyuk Kim,Nikunj Drolia,Dasol Choi,Kyong-Ha Lee,Youngjae Yu

from arxiv, Work in Progress

Recent frontier models employ long chain-of-thought reasoning to explore solution spaces in context and achieve stonger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduct **Language-Mixed CoT**, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artificats. As a Korean case study, we curate **Yi-Sang**: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score (64.0 \pm 25), ranking first on 5/9 benchmarks and second on the remainder. Samller and mid-sized models also benefit substantially, with an average improvement of +18.6 points across teh evaluated nine benchmarks. Ablations show **Language-Mixed CoT** is more effective than monolingual CoT, also resulting in cross-lingual and mult-modal performance gains. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning. Data and model collection: https://huggingface.co/KOREAson.

翻译：近期前沿模型采用长链思维推理在上下文中探索解空间，从而实现更强的性能。尽管许多研究致力于通过蒸馏构建更小但能力相当的模型，但大多聚焦于英语，对于特定语言的推理机制知之甚少。为填补这一空白，我们首先提出**语言混合思维链**，这是一种在英语与目标语言之间切换的推理框架，以英语为锚点发挥推理优势，同时最小化翻译伪影。以韩语为例，我们构建了**Yi-Sang**数据集：包含来自网络问答、考试、STEM及代码领域的579万条原生韩语提示；由Qwen3-32B生成的370万条长推理轨迹；以及针对性的26万条高价值子集。我们在六个模型系列（Qwen2.5、Llama-3.1、Gemma-3等）上训练了九个模型（4B-35B）。我们的最佳模型**KO-REAson-35B**实现了最先进的性能，其综合平均得分最高（64.0 ± 25），在9个基准测试中的5个排名第一，其余均位列第二。中小型模型亦获益显著，在评估的九个基准测试中平均提升+18.6分。消融实验表明**语言混合思维链**比单语思维链更有效，同时带来跨语言与多模态的性能增益。我们公开数据构建流程、评估系统、数据集及模型，以推动特定语言推理的研究。数据与模型集：https://huggingface.co/KOREAson。