Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.
翻译:大型语言模型(LLMs)显著推动了机器翻译(MT)的发展,但其在网络小说翻译中的有效性仍不明确。现有基准依赖于表层指标,无法捕捉该体裁的独特特征。为填补这些空白,我们提出了DITING,首个针对网络小说翻译的综合性评估框架,从六个维度评估叙事与文化保真度:成语翻译、词汇歧义、术语本地化、时态一致性、零代词解析和文化安全性,并辅以超过18K个专家标注的中英句子对。我们进一步提出了AgentEval,一个推理驱动的多智能体评估框架,通过模拟专家审议来评估超越词汇重叠的翻译质量,在七种测试的自动指标中实现了与人工判断的最高相关性。为支持指标比较,我们开发了MetricAlign,一个包含300个句子对的元评估数据集,标注了错误标签和标量质量分数。对十四个开源、闭源及商业模型的综合评估表明,中文训练的LLMs超越了规模更大的国外同类模型,且DeepSeek-V3实现了最忠实且风格一致的翻译。我们的工作为探索基于LLM的网络小说翻译建立了新范式,并提供了公开资源以推动未来研究。