Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.
翻译:大语言模型(LLMs)推动了机器翻译的进步,但仍易产生幻觉。遗憾的是,现有机器翻译基准难以暴露多语言LLMs的缺陷。为揭示多语言LLMs中的幻觉现象,我们提出了一个诊断框架,其分类法将指令脱离与源文脱离区分开来。基于该分类法,我们构建了HalloMTBench——一个涵盖11个英语到X语言方向、经人工验证的多语言基准。我们采用4个前沿LLMs生成候选译文,并通过LLM评审团集成与专家验证对这些候选译文进行严格审查,最终筛选出5,435个高质量实例。我们在HalloMTBench上评估了17个LLMs,结果揭示了独特的“幻觉触发因素”:反映模型规模、源文长度敏感性、语言偏见以及强化学习(RL)放大的语言混合等特定失败模式。HalloMTBench为诊断LLM翻译缺陷提供了前瞻性测试平台。HalloMTBench可通过https://huggingface.co/collections/AIDC-AI/marco-mt获取。