Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning. However, their impact on machine translation (MT) remains underexplored. We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models. Analysis reveals that MT reasoning traces are highly linear, lacking revision, self-correction and exploration of alternative translations, which limits their usefulness. Furthermore, injecting higher-quality reasoning traces from stronger models does not reliably improve weaker models' performance. To address this mismatch, we propose a structured reasoning framework tailored to translation, based on multi-step drafting, adequacy refinement, fluency improvement, and selective iterative revision. We curate a synthetic dataset of dynamic structured reasoning traces and post-train a large reasoning model on this data. Experiments show significant improvements over standard translation fine-tuning and injected generic reasoning baselines. Our findings demonstrate that reasoning must be task-structured to benefit MT.
翻译:面向推理的大型语言模型(RLMs)通过生成显式的中间推理步骤,在数学和编程等任务上取得了显著提升。然而,其对机器翻译(MT)的影响仍未得到充分探索。我们在WMT24++基准上系统评估了多个开源和闭源权重的RLMs,发现启用显式推理会持续降低跨语言和跨模型的翻译质量。分析表明,机器翻译的推理轨迹高度线性,缺乏修订、自我纠正以及对替代翻译方案的探索,这限制了其有效性。此外,从更强模型中注入更高质量的推理轨迹并不能可靠地提升较弱模型的性能。为解决这种不匹配问题,我们提出了一种针对翻译任务定制的结构化推理框架,该框架基于多步草拟、充分性精炼、流畅性提升以及选择性迭代修订。我们构建了一个动态结构化推理轨迹的合成数据集,并在此数据上对一个大推理模型进行了后训练。实验结果表明,该方法相较于标准的翻译微调以及注入通用推理的基线模型均有显著改进。我们的研究证明,推理必须针对任务进行结构化设计才能有益于机器翻译。