Existing machine translation (MT) metrics and discourse-focused evaluations primarily assess translation quality intrinsically, without measuring the downstream consequences of translation errors. In this work, we focus on extrinsic discourse evaluation of machine translation under two distinct regimes: static and interactive. Under the static regime, we propose an entity counting task as a probe of referential consistency in discourse. We show that high intrinsic MT quality does not reliably predict downstream discourse success and strong MT systems still produce referential inconsistencies. For the interactive regime, we study the goal-oriented multi-agent Welfare Diplomacy game as a probe of long-horizon communication and coordination. We find that interaction-specific translation failures impact downstream coordination. Our results highlight goal-oriented environments as a viable framework for discourse-sensitive extrinsic MT evaluation.
翻译:现有机器翻译(MT)指标和语篇层面的评估主要从内部质量视角衡量翻译效果,未关注翻译错误的下游后果。本研究聚焦于两种不同模式下的机器翻译外语言语评估:静态模式与交互模式。在静态模式下,我们提出实体计数任务作为语篇指代一致性的探测方法。研究表明,高内部翻译质量并不能可靠预测下游语篇任务的成功表现,即使强大MT系统仍会产生指代不一致问题。针对交互模式,我们以面向目标的多智能体福利外交游戏为探测工具,研究长时域通信与协调机制。研究发现,交互特有的翻译失败会影响下游协调效果。研究结果表明,面向目标场景可作为语篇敏感性外语言语评估的有效框架。