Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model's reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models.
翻译:测试时扩展(TTS)已提升推理模型(RMs)在数学和编程等多种任务上的性能,但其在机器翻译(MT)中的有效性仍未得到充分探索。本文研究了增加推理时计算是否能改善翻译质量。我们在一套涵盖多个领域的多样化机器翻译基准上评估了12个推理模型,考察了三种场景:直接翻译、强制推理外推和后编辑。我们的研究结果表明,对于通用推理模型,测试时扩展对直接翻译带来的益处有限且不一致,性能很快达到平台期。然而,通过领域特定的微调可以释放测试时扩展的有效性,这种微调使模型的推理过程与任务要求对齐,从而带来持续的改进,直至达到一个最优的、自我确定的推理深度。我们还发现,强制模型推理超出其自然停止点会持续降低翻译质量。相比之下,测试时扩展在后编辑场景中被证明非常有效,能可靠地将自我修正转化为有益的过程。这些结果表明,推理时计算在机器翻译中的价值不在于用通用模型增强单次翻译,而在于多步骤、自我修正工作流等针对性应用,以及与任务专用模型的结合。